Sunday, July 10, 2011

Rebuilding Radio NZ - Part 12: Migrating Episodes

The migration of programme episode content from MySource Matrix into our new Rails-based CMS has proven to be the largest and most complex task. This week I am going to dive heavily into the code we used to do this.

For most programmes we have maintain a programme library of content dating back to the start of 2008. Some go back further. There are approximately 10,000 episodes from dozens of programmes, each with links, images and embedded video. The images we all stored in Matrix and also had to be transferred.

Because Matrix pages are only assembled when they are requested publicly, the most practical option was to download each page and scrape the content. We have been very consistent in the HTML mark-up used on the site, so this would be reasonably successful.

The first task was to get an inventory of URLs for each programme. A manifest file was created for each programme that gave a list of all the unique episode URLs for that programme, and optionally (if it existed) the episode summary from the metadata. I was not able to rely on programme schedules to get this information because some programmes run specials (Morning Report and Checkpoint) and others were cancelled due to Civil Defense events.

In Matrix I used one Asset Listing for each programme to generate a manifest in XML, and wrote a script to read each of these and cache them locally. Once available, a second script requested every URL listed in the manifests and cached these as well.

Caching the files locally allowed much faster test runs, eliminated unnecessary load on the server and also allowed me to quickly make small tweaks to the HTML if required.

The next step was to extract the core content from the page. I used Nokogiri as my tool of choice for this task.

All body content on the site sits within a div with id #cont-pri. After extracting this block from the page unwanted elements were removed from the DOM.

Audio was removed completely, as this is linked to episodes by association, and rendered dynamically by ELF. Other content to be added later (such as promotions of future content) was also removed.

Host information (in a paragraph with the host class) was extracted, and the spelling of some names was corrected. For each programme I ran the script with debugging code in place to list all episodes where the host could NOT be extracted. Once identified I updated the cache files, or altered the import routine to allow for the variation.

During the actual import the audio and relevant presenter were associated with each episode and saved.

Images

The import script also iterated over all images in the content, cached them locally, then uploaded them into ELF, and changed the link in the HTML to point to the new image.

The import routine was designed to run on the same content repeatedly without creating duplicates. This made it simple to re-parse and import the content again if something incorrect was found later.

As it happens we did find a couple of markup issues and ran the importer on our live system, under normal load with people browsing the content.

I have posted the code on github. I make no excuses for the ugliness or levels of hackery. This code is of the get-it-going-fast-run-it-once-and-throw-it-away-while-also-learning-ruby variety. Don't expect too much.

The clean_up_episode_html is of interest as this is where the captured HTML is cleaned up. A series of regular expressions are used to find and replace mark-up generated by the WYSIWYG in Matrix that is not what we'd ideally want.

The HTML is then sent to html tidy and a smaller set of regexs is used to fine tune the code.

This function was fine-tuned by running episodes though it and watching the output for anomalies. Existing regexs were adjusted, or new ones added to clean the output.

Now we have most regular Radio NZ programmes running in ELF: Morning Report, Nine To Noon, Midday Report, Afternoons, Nights, Saturday, This Way Up, Sunday and Arts on Sunday.

Many of Radio NZ Concert's programme are in ELF too: Upbeat, Music Alive, and many others.

Now that most of the hard transfer work is over, I am focussing on building out functionality to support programmes that don't fit existing patterns. Examples are Enzology and New Horizons.

I am also working on improving the administration section - more programmes means more users and this is generating good ideas for improvement.

I'll be posting only fortnightly on this topic from now on, as I have nearly caught up with the work currently being done on ELF.

3 comments:

Dave Lane said...

Richard, these blog entries are a great resource. It's cool to see what you and RadioNZ are achieving with these great FOSS technologies. I'm a frequent listener to RNZ Podcasts, always selecting the OGG version where available. I've had a bit of a frustration, however, because it doesn't appear that the site is currently setting a mime type headers for the OGG files - they're coming up as "type: unknown" for me in Firefox (meaning I always have to manually select an app to play them, which is a bit of a hassle). The mime type for purely audio podcasts should be "audio/ogg". Is that something that's easily remedied? Seems that the mp3 files do have a valid mime type...

Many thanks,

Dave Lane (@lightweight)

Richard Hulse said...

Thanks for your comments Dave. Did you see we extended the number of Ogg RSS feeds recently:
http://www.radionz.co.nz/oggcasts

I will get our CDN guys to fix the mime type.

Richard Hulse said...

This should be fixed now for new content. Old content will update when the CDN caches expire and the content is updated.

FYI, the OGG content type was not set on our origin (master) server for the CDN. Let me know (via webmaster at radionz dot co dot nz) if you have any further issues.