Friday, July 3, 2009

Parsing Word HTML with Ruby and Rails

I promised I would write about my first Rails project, so here goes.

Within Radio NZ we have a web application for parsing Microsoft Word documents and reformatting them for the web. Content from Word is pasted into a WYSIWYG cell, and submitted to the application. A preview of the parsed content is presented for review.

If this looks OK, then the page is submitted again and an XML document is generated from the content. This XML is sent to our Content Management System to be imported.

If the content is NOT to be imported, then it can be cut from the preview page, and pasted into the CMS WYSIWYG.

The parser ensures that the code is cleaned of extraneous markup, and validates. In the case of pre-formatted documents like our schedules, it can split a week's content into 7 day-parts, format subheadings, add emphasis to times, and add links to programme names. You can see the result here and here.

The old application was written in PHP and used a combination of regular expressions, HTML Tidy, and some stream-based parsing to work its magic.

The updated application is written in Ruby on Rails, uses the Sanitize and Hpricot Gems, and is much more modular. The reasons for the change was to make the app more maintainable - I wanted to add more parsing filters - and the PHP code was a bit of a mess.

I could have refactored the PHP version, but I needed a real project to help me learn Rails, and I suspected it would be less work anyway. Also, having testing built in has advantages when you are writing a parser.

The Rails version has the same basic workflow. Content from Word is pasted into a WYSIWYG (in this case the FCK Editor). The HTML is sent to this bit code which cleans most of the rubbish and does a few RNZ specific things like standardise time formatting.

The cleaner adds new lines after certain tags, and this is passed to a stream-based parser. That walks through the document and processes it based on the document type (set via a drop-down setting when the content was submitted).

The new version is in production now, and is more reliable than the old. This is partly because I cleaned up some of the underlying algorithms, fixed some logic bombs, and added a lot more error checking.

One important check added in this version was for smartags. This is a feature of Word the tags certain Words to give them special attributes. The problem is that when pasted and parsed they do not appear in the final document. The new parser checks for these and reminds the user to remove them first.

I really liked the Rails framework. The two best parts were:
  1. Having sensible defaults and behaviours for things. I used to spend most of my time in PHP just configuring stuff and getting things flowing.
  2. The second was the Ruby language (and the Rails extension to it). Just brilliant. The language design is very good with consistent interfaces and predictable syntax. It certainly made a nice change from working which PHP function to use and what order the parameters should be in.
I have also coded in Perl and C (and Assembler), and I like some of the Perlish things that have made their way into Ruby. You can use =~ to compare a string with a /regex/. Cool. Being able to write

when /regex/

inside a

case string

block. Very cool

(I still use a lot of Perl - the Radio NZ audio and text publishing engines are built with Perl).

There are some other Rails projects in the works at Radio NZ - one of them is a search tool based on Solr (an update of the BRAD application that I am working on with Marcus from AbleTech) - so expect some more Ruby and Rails posts in the near future.


nzlemming said...

Are you aware of Matt Holloway's Docvert project?

I appreciate you were using this as a learning experience, but Docvert is full of awesome :-)

Richard Hulse said...

Yes, I was aware of its truly awesomeness, and did think of using it.

The main issue was that some of the documents are far from semantically structured. The only way to parse them is often line by line taking into account the position in the file and the last occurrence of certain known string. I also have other content types to parse - HTML from emails and from web pages.

I'll ask Matt about it though, next time I see him.

bkpavan said...

Richard, how do you convert Word documents into HTML or XML?
Is it a manual process of copying into FCK Editor or you are using any conversion tools?


Richard Hulse said...

Hi Pavan, Users just copy from Word and paste into the FCK Editor. The content in FCK is the HTML that we just clean.

I expect that you could upload a document and programatically extract the HTML, or even use docvert and process the result of that.