Saturday, April 30, 2011

Rebuilding Radio NZ - Part 4: Content Extraction & Recipes

The next group of posts will deal with the migration of content. In each I’ll show how we were managing the particular content type in Matrix, the design of the content type in ELF, how we migrated the content, and how we manage the content now.

Getting the content out

There were two options for getting the content out of Matrix.

The first was a custom script that we could use for exporting the whole site. The difficulty was defining everything I would need up-front for Squiz to code against - there were many types of content and different requirements, most not known. The other issue was cost - a script to extract just news was estimated to take about 30 hours to write.

The second option was to setup pages to display groups of single assets in XML format. An example of this was audio assets. These are self contained and contain all the required data for importing to ELF. Where this was not possible screen-scraping would be used. More on this in a future post.

The DIY approach has worked out simpler and cheaper, and I have been able to adjust the export and import to suit each kind of content, building on code from the previous phase.


The recipes section was chosen to go first because the section was completely self-contained apart from some in-bound links from programme pages.

Matrix recipes

At the original launch in 2005 we had high expectations for our recipes section - we wanted to divide content into sections based on ingredients and style of cooking. This proved to be more difficult than expected. At right is what our the tree in Matrix looked like.

The recipes home page had seasonal ingredients at the bottom, the right had section featured recipes baed on the season or special events (Christmas, Thanksgiving, Easter, etc), and visitors could search or browser by recipe title.

Managing the content was simplified by putting recipes into lettered folders, however the complex asset structure made it hard to see at a glance what recipes were in which section. Another problem is that the URL structure has an extra (redundant) segment in it with the first letter of the recipe.

When tagging was added we tried that, but this required linking every recipe to a pre-named tag, and making new ones on the fly. This would have required a complete reworking of the section, and all-in-all was too unwieldy to use, even for a section that gets only 3 to 5 new recipes a week.

Recipes took about 5 -10 minutes to format, link into the correct folders and set a future status that matched the broadcast time.

Pasting recipe content into the WYSIWYG from email and Word documents was patchy. Often the markup would contain code that could not be removed with the built in code cleaner. We are very fussy about the quality of our markup, so we developed a separate pre-parser to deal with markup issues. The parser has an FCK Editor and a drop-down to select the type of content. This was able to remove extraneous markup and ensure that the XHTML returned was valid. It was also able to do basic formatting for some type of content.

Even a two-step cut and paste process was faster than hand editing code in the Matrix WYSIWYG (or any editor).


Designing recipes was pretty simple. Recipes have a title, body and broadcast date. They have a chef, tags and are broadcast on a particular programme.

In Rails terms (edited for brevity):

has_many :chefs
belongs_to :programme
acts_as_taggable_on :tags

The programme model contained basic information about the programme such as name and webpath, just to get us started.

Having a chef association and tags allows us to provide navigation by tag and by chef. Since adding both features visitor engagement in that part of the site has increased.

Content Migration

Importing the content was tricky. Each recipe had chef and programme information in the HTML. The import script had to find this information and make the necessary associations.

A rake task was written to parse the content and create recipe assets in ELF. I have posted the code as a gist on github for reference purposes. Note that I was learning Ruby at the time and that it is fairly rough and ready.

As the import script was being written I had it output recipes where it could NOT extract this information. These were found to not be formatted in the standard way, and were edited so that they could be imported.

Tags were manually added to each recipe.

ELF recipes management

In ELF we wanted a data entry screen designed specifically for recipes. This would need to allow for tagging and specifying a chef and broadcast time. And here it is:

The edit screen is simple to use. The tag list offers auto-completion to avoid duplication, and add chef allows new chefs to be added without going to a new screen. A recipe can be added in under 5 minutes.

The WYSIWG is based on CK Editor. This has powerful built-in routines to clean markup pasted from email and MS Word.

The recipes footers which contain seasonal ingredients, and the sidebar with special features both have their own manager: This allows the content to be reused and updated each year.

Now that tagging has been simplified, the seasonal ingredient lists (bottom of page) links to relevant tags. The system allow free-tagging, so any new tag is available immediately. Page impressions in the recipes section is double what it was at this time last year, driven by people browsing content by chef and by tag.

An image uploader is built-in, so pictures can be uploaded and added right inside the WYSIWYG.

Handling old URLs

Legacy URLs are passed to the search page, where it will attempt to extract the title to use as the basis of a search. Try this broken URL for example. In most cases this will give the visitor the recipe they want.

The new recipes section was soft-launched last year, and has streamlined to entry of recipes and improved the user experience.

It also gave me some confidence that we were on the right path.

In the next post I'll cover the evolution of our news section from a basic service offering only 20-30 stories at a time, to the current version with categories and sophisticated remote management.

No comments: