Friday, August 15, 2008

Using Git to manage rollback on dynamic websites

Page rollback is useful for archival and legal reasons - you can go back and see a page's contents at any point in time. It is also a life-saver if some important content gets accidentally updated - historical content is just a cut and paste away. The MediaWiki software the runs Wikipedia is a good example of a system with rollback.

There are several methods available to a programmer wanting to enable a rollback feature on a Content Management System.

The simplest way to do this is store a copy of every version of the saved page, and maintain a list of pointers to the most recent pages. It would also be possible to store diffs in the database - old versions are saved as a series of diffs against the current live page.

A useful feature would be the ability to view a snapshot of the entire site at any point in time. This is probably of greatest interest to state-owned companies and Government departments who need to comply with legislation like New Zealand's Public Records Act.

A database-based approach would be resource intensive - you'd have to get all the required content, and then there is the challenge to display an alternative version of the site to one viewer while others are still using the latest version.

I was thinking of alternatives to the above, and wondered if a revision control system might be a more efficient way to capture the history, and to allow viewing of the whole site at points in time.

Potentially this scheme could be used with any CMS, so I thought that I'd document it in case someone finds it useful.

Git has several features that might help us out:

Cryptographic authentication

According to the Git site: "The Git history is stored in such a way that the name of a particular revision (a "commit" in Git terms) depends upon the complete development history leading up to that commit. Once it is published, it is not possible to change the old versions without it being noticed. Also, tags can be cryptographically signed."

This is a great choice from a legal perspective.

Small Storage Requirements

Again, from the site: "It also uses an extremely efficient packed format for long-term revision storage that currently tops any other open source version control system".

Josh Carter has some comparisons and so does Pieter de Biebe.

Overall it looks as though Git does the best job of storing content, and because it is only storing changes it'll be more efficient that saving each revision of a page in a database. (Assuming that is how it is done.)

And of course Git is fast, although we are only using commit and checkout in this system.

Wiring it Up

To use Git with a dynamic CMS, there would need to a save hook that did the following.

When a page is saved:

1. Get the content from the page being saved and the URL path.

2. Save the content and path (or paths) to a queue.

3. The queue manager would take items off the queue, write them to the file system and commit them.

These commits are on top of an initial save of the whole site, whatever that state may be. The CMS would need a feature that outputs the current site, or perhaps a web crawler could be used.

To view a page in the past, it is a simple matter to checkout the commit you want and view the site as static pages via a web server. Because every commit is built on top of previous changes, the whole site is available as it was when a particular change was made.

The purpose of the queue manager is to allow commits to be suspended so that you can checkout an old page, or for maintenance. Git gc could be run each night via cron while commits were suspended. I'd probably use IPC::DirQueue because it is fast, stable, and allows multiple processes to add items to the queue (and take them off), so there won't be any locking or race issues.

Where the CMS is only managing simple content - that is, there are no complex relationships between assets such as nesting, or sharing of content - this scheme would probably work quite well.

There are problems though when content is nested (embedded) or shared with other pages, or is part of an asset listing (a page that display the content of other items in the CMS).

If an asset is nested inside another asset the CMS would need to know about this relationship. If the nested asset is saved then any assets it appears inside need to be committed too, otherwise the state of content in Git will not reflect what was on the site.

I'd expect a linked tree of content use would be implemented to manage intra-page relationships and provide the information about which pages need to be committed.

This is all theoretical, but feel free to post any comments to extend the discussion.

No comments: