Friday, August 29, 2008

Opening up old content and more Ogg Vorbis

Just today we completed work on the Saturday Morning with Kim Hill programme archive.

This opens up all the content that has been broadcast on the show since the start of the year.

All audio can be downloaded in MP3 format, and in Ogg Vorbis since August. The programme archive page lists a summary of all programmes, and you can also search audio and text.

Morning Report and Nine To Noon now have Ogg Vorbis, and I expect to be able to open up their older content in the next few weeks.

Other programmes will have Ogg Vorbis added as I have time.

I have a test RSS feed (the URL might change) with Ogg enclosures for Saturday Morning. Send any feedback to webmaster at radionz dot co dot nz.

Wednesday, August 27, 2008

Creating a Pseudo Daemon Using Perl

One of the trickier software jobs I've worked on is a Perl script that runs almost continuously.

The script was needed to check a folder for new content (in the form of news stories) and process them.

The stories are dropped into a folder as a group via ftp, and the name of each story is written into a separate file. The order in this file is the order the stories need to appear on the website.

The old version of the script ran by cron every minute, but there were three problems.

The first is that you might have to wait a whole minute for content to be processed, which is not really ideal for a news service.

The second is that should the script be delayed for some reason, it is possible to end up with a race condition with two (or more scripts) trying to process the same content.

The third is that the script could start reading the order file before all the files were uploaded.

In practice 2 and 3 were very rare, but very disruptive when they did occur. The code needed to avoid both.

The new script is still run once per minute via cron, but contains a loop which allows it to check for content every three seconds.

It works this way:

1. When the script starts it grabs the current time and tries to obtain an exclusive write lock on a lock file.
2. If it gets the lock it starts the processing loop.
3. When the order file is found the script waits for 10 seconds. This is to allow any current upload process to complete.
4. It then reads the file, deletes it, and starts processing each of the story files.
5. These are all written out to XML (which is imported into the CMS), and the original files are deleted.
6. When this is done, the script continues to look for files until 55 seconds have elapsed since it started.
7. When no time is left it exits.

This is what the loop code looks like:

my $stop_time = time + 55;

my $loop = 1;
my $locking_loop = 1;
my $has_run = 0;

while( $locking_loop ){

# first see if there is a lock file and
# wait till the other process is done if there is
if( open my $LOCK, '>>', 'inews.lock' ){

flock($LOCK, LOCK_EX) or die "could not lock the file";
print "** lock gained " . localtime(time) . " running jobs: " if ($debug);

while( $loop ){
my $job_count = keys(%jobs);
for my $job (1..$job_count){
# run the next job if we are within the time limit
if( time < $stop_time ){ $has_run ++; print "$job "; process( $job ); sleep(3); } else{ $loop = 0; $locking_loop = 0; } } } close $LOCK; print "- lock released\n" } else{ print "Could not open lock file for writing\n"; $locking_loop = 0; # nothing happens here } unless($has_run){ print localtime(time) . " No jobs processed\n" } }

This is the output of the script (each number is the name of a job - we have three jobs, each for one ftp directory:

** lock gained Tue Jul 22 08:17:00 2008running jobs:1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 - lock released
** lock gained Tue Jul 22 08:18:00 2008 running jobs: 1 2 3 1 2 3 1 2 3 1

24 news files to process
* Auditor-General awaits result of complaint about $100,000 donation
* NZ seen as back door entry to Australia
.
snip
.
- lock released
** lock gained Tue Jul 22 08:19:23 2008 running jobs: 1 2 3 1 2 3 1 2 3 1 2 - lock released


You can see that the script found some content part way through its cycle, and ran over the allotted time, so the next run of the script did not get a full minute to run.

This ensures that there is never a race between two scripts. You can see next happens when a job take more than 2 minutes:

- lock released
** lock gained Tue Jul 22 10:46:32 2008 running jobs: - lock released
Tue Jul 22 10:46:32 2008 No jobs processed
** lock gained Tue Jul 22 10:46:32 2008 running jobs: - lock released
Tue Jul 22 10:46:32 2008 No jobs processed
** lock gained Tue Jul 22 10:46:32 2008 running jobs: - lock released
Tue Jul 22 10:46:32 2008 No jobs processed
** lock gained Tue Jul 22 10:46:32 2008 running jobs: 1 2 3 1 2 3 1 2


All the scripts that were piling up waiting to get the lock exited immediately, once they found that there time was up.

I am using the same looping scheme to process a queue elsewhere in our publishing system and I'll explain this in my next post.

Friday, August 15, 2008

Using Git to manage rollback on dynamic websites

Page rollback is useful for archival and legal reasons - you can go back and see a page's contents at any point in time. It is also a life-saver if some important content gets accidentally updated - historical content is just a cut and paste away. The MediaWiki software the runs Wikipedia is a good example of a system with rollback.

There are several methods available to a programmer wanting to enable a rollback feature on a Content Management System.

The simplest way to do this is store a copy of every version of the saved page, and maintain a list of pointers to the most recent pages. It would also be possible to store diffs in the database - old versions are saved as a series of diffs against the current live page.

A useful feature would be the ability to view a snapshot of the entire site at any point in time. This is probably of greatest interest to state-owned companies and Government departments who need to comply with legislation like New Zealand's Public Records Act.

A database-based approach would be resource intensive - you'd have to get all the required content, and then there is the challenge to display an alternative version of the site to one viewer while others are still using the latest version.

I was thinking of alternatives to the above, and wondered if a revision control system might be a more efficient way to capture the history, and to allow viewing of the whole site at points in time.

Potentially this scheme could be used with any CMS, so I thought that I'd document it in case someone finds it useful.

Git has several features that might help us out:

Cryptographic authentication

According to the Git site: "The Git history is stored in such a way that the name of a particular revision (a "commit" in Git terms) depends upon the complete development history leading up to that commit. Once it is published, it is not possible to change the old versions without it being noticed. Also, tags can be cryptographically signed."

This is a great choice from a legal perspective.

Small Storage Requirements

Again, from the site: "It also uses an extremely efficient packed format for long-term revision storage that currently tops any other open source version control system".

Josh Carter has some comparisons and so does Pieter de Biebe.

Overall it looks as though Git does the best job of storing content, and because it is only storing changes it'll be more efficient that saving each revision of a page in a database. (Assuming that is how it is done.)

And of course Git is fast, although we are only using commit and checkout in this system.

Wiring it Up

To use Git with a dynamic CMS, there would need to a save hook that did the following.

When a page is saved:

1. Get the content from the page being saved and the URL path.

2. Save the content and path (or paths) to a queue.

3. The queue manager would take items off the queue, write them to the file system and commit them.

These commits are on top of an initial save of the whole site, whatever that state may be. The CMS would need a feature that outputs the current site, or perhaps a web crawler could be used.

To view a page in the past, it is a simple matter to checkout the commit you want and view the site as static pages via a web server. Because every commit is built on top of previous changes, the whole site is available as it was when a particular change was made.

The purpose of the queue manager is to allow commits to be suspended so that you can checkout an old page, or for maintenance. Git gc could be run each night via cron while commits were suspended. I'd probably use IPC::DirQueue because it is fast, stable, and allows multiple processes to add items to the queue (and take them off), so there won't be any locking or race issues.

Where the CMS is only managing simple content - that is, there are no complex relationships between assets such as nesting, or sharing of content - this scheme would probably work quite well.

There are problems though when content is nested (embedded) or shared with other pages, or is part of an asset listing (a page that display the content of other items in the CMS).

If an asset is nested inside another asset the CMS would need to know about this relationship. If the nested asset is saved then any assets it appears inside need to be committed too, otherwise the state of content in Git will not reflect what was on the site.

I'd expect a linked tree of content use would be implemented to manage intra-page relationships and provide the information about which pages need to be committed.

This is all theoretical, but feel free to post any comments to extend the discussion.

Saturday, August 9, 2008

Setting up Ogg Vorbis Coding for MP2 Audio Files

Today on Radio New Zealand National, Kim Hill interviewed Richard Stallman on the Saturday Morning programme. Prior to the interview Richard requested that the audio be made available in Ogg Vorbis format, in addition to whatever other formats we use.

At the moment our publishing system generates audio in Windows Media and MP3 formats, so I had two options: generate the Ogg files by hand on the day, or add an Ogg Vorbis module to do it automatically.

Since the publishing system is based on free software (Perl), it was a simple matter to add a new function to our existing custom transcoder module. It also avoided the task of manually coding and uploading each file. Here is the function:

sub VORBIS()
{
my( $self ) = shift;
my( $type ) = TR_256;

my $inputFile = $self->{'inputFile'};
my $basename = $self->{'basename'};
my $path = $self->{'outputPath'};
my $ext = '.ogg';

my $output_file = "$path\\$basename$ext";

my $lame_decoding_params = " --mp2input --decode $inputFile - ";
my $ogg_encoding_params = qq{ - --bitrate 128 --downmix --quality 2 --title="$self->{'title'}" --artist="Radio New Zealand" --output="$output_file" };

my $command = "lame $lame_decoding_params | oggenc2 $ogg_encoding_params";

print "$command \n" if ($self->{'debug'});
my $R = $rates{$type};
&RunJob( $self, $command, "m$rates{$type}" );

# we must return a code so the caller knows what to do
return 1;
}
All the Ogg parameters are hard-coded at this stage, and I'll add code to allow for different rates later, once I have done some tests to see what works best for our purposes.

Once this code was in place, the only other change was to update the master programme data file to switch on the new format.

After that, any audio published to the Saturday programme would automatically get the new file format, and this would also be uploaded to our servers.

The system was originally built with this in mind - add a new type, update the programme data, done - but this is the first time I have used it.

You can see the result on today's Saturday page.