Bring out your dead!

Yesterday I attended the Digital Preservation Coalition’s ‘Bring out your dead!’ day (subtitled more soberly: ‘Collaborative approaches to managing file formats – a day of action’. Monty Python reference appreciated, though). The point of the day was to discuss the problems that file formats present to digital preservation (and also to bring along our own problem files). These obviously include dealing with very unusual files and files of unknown type. There are tools, such as DROID, which can help with identifying unknown files. The surprise for me was how much of the discussion focussed on older versions of very well-known file types, particularly Microsoft Office files. Chris Rusbridge, in his opening presentation, talked about his experience of trying to convert his old PowerPoint files, created years ago on an old Mac, to a current format (he found a company that could do it) and also his open letter to Microsoft about publishing the specifications for old versions of their file formats. The latter had surprising results (Microsoft willing to help, but they don’t have the specs).

 

Much of the discussion also centred on the importance of collaboration to solving the file format problem. A key part of this is contributing information to file format registry projects such as PRONOM and CRISP, which suffer from lack of detail in their records (though TNA’s David Clipsham explained that the focus for PRONOM had been on populating it with file signatures, rather than other details). David Clipsham presented a session on how to produce file signatures which, to a non-techy like me, was an eye opener. It was especially interesting to see how easy it is. Essentially, a file signature is a string of binary data which always appears within a file of a particular sort and so which can be used by a tool such as DROID to diagnose the file type. To spot these strings, all one has to do is open a number of examples of the file type in a hexadecimal editor and flick through the open tabs quickly (a bit like a flicker book) to see which bits of the file stay the same (usually it’s the beginning bit). Submitting examples of file signatures to PRONOM helps to build up the registry and increase its usefulness.

 

Other ways that we need to collaborate as a sector include better coordination between projects and coordinated lobbying of software companies (and government), of the sort pioneered by Chris Rusbridge. There are several projects which have attempted to create file format registries and the consensus seemed to be that it isn’t necessarily a bad thing to have more than one registry, but that data needs to be shared systematically (preferably automatically) between them. The same body of data shared in several places isn’t bad, but data split between several places is. In the lively final discussion a big topic was the need for someone to take a strong leadership or coordination role in carrying forward the lobbying of software companies to release file specifications. Chris Rusbridge rather theatrically put William Kilbride of the DPC on the spot and William deftly turned this round as a question for the DPC’s members as to whether the DPC was the organisation to take a lead in this.

 

If I have a criticism of the day it was that it was billed (unless I got the wrong end of the stick, which commonly happens) as a workshop for dealing with delegates’ problem files (the ‘dead’ of ‘Bring out your dead’). The discussions and workshops were interesting, presenting a number of tools for charactersing files and creating signatures. However, it was less hands-on than I expected. I came down with a slightly different problem to that of unknown/obsolete file formats. We have some TIF files which create problems when converted to JP2 and streamed and we don’t know why. I didn’t get any answer from the workshop itself, but I was at least able to discuss it with other people during the break and find a possible source of help. Which was useful!

Advertisements

1 Comment »

  1. Interesting to hear about file signature extraction tool DROID, and file registries PRONOM and CRISP, I wasn’t aware of either. Similarly, Apache Tika has a detector for Magic Byte (i.e. type-specific patterns near the beginning of the document input stream), I like the flicker book analogy 🙂

    Shame that certain JP2 are a problem with IIPImage server, do you think it is a problem in encoding from TIFF->JP2? Have you tried using a different conversion tool to convert the same TIFF, this will help identify if the conversion tool (or configuration) is to blame.

    Either way, it would be best to complete investigation started at IIPImage forums : http://goo.gl/WxKCA

RSS feed for comments on this post · TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: