[Return to Home Page] The
original essay was published in February 1999. ESSAY (February 1999) (February 1999) [Printed in "Reality Module No.8."] All The Dead Data Introduction Related Works In Library School in 1988 I heard of spools of 2-inch magnetic tape sitting on shelves in a warehouse somewhere in the United States, containing meteorological data collected in the 1960s. Sitting there - untouched, gathering dust, slowly deteriorating. The old computing machinery which recorded the tapes has been superseded, and quietly without-fuss been trashed. Furthermore whatever written records which once existed describing the format in which the information was recorded on the tapes is also lost, misplaced or thrown away. Even if somehow we could assemble an old machine and read the information on these tapes before it fades away, the tape binder cracks, the ferrous-oxide flakes away, or print-through corrupts the data - we will get a sequence of ones and zeroes meaningless to us. 30 years after the event - we have lost the key to this data. There are many issues here. Hardware ConsiderationsIn this digital age we constantly invent the new, and discard the old. In this orgy of creation and recreation there are many casualties. We can all name many of them - the old boxes of punched cards and the punch interface, generations of reel-to-reel magnetic tapes like in so many old SF movies, the old washing-machine sized 5Mb Winchester harddrives, generations of relays and valves and magnetic core memories and transistors and integrated circuits and superseded CPUs, 8-inch floppy drives, 5 1/4-inch floppy drives, tape cartridges, digital tape drives, magneto-optical drives, WORM drives, the now threatened 3 1/2-inch floppy disk, and its would-be-successors the ZIP-drive, the superfloppy, the Supra-drive, and for mass storage CD-ROMs and the new DVD-ROM drives. All these storage media, all these reading/writing/access devices are gone or under threat, even standards for peripheral access (RS-232 serial, parallel ports, IDE & SCSI) are being threatened by new-comers like the Universal Serial Bus (USB) and FireWire. It is no use leaving your 3 1/2-inch floppies or your CD-ROMs to your grandchildren. They will not have a device to read them. (Issues of media longevity pale into insignificance when you realise that the medium will be obsolete long before the data begins to decay.) This is the first issue - we no longer have ready access to the hardware needed to access our old data. Software ConsiderationsThere is another twist. James Gleick in his article "The Digital Attic" (2) quotes Stewart Brand (Creator of 'The Whole Earth Catalog'): "Paper at least degrades gracefully," says Brand nostalgically, "Digital files are utterly brittle; they're complexly immersed in a temporary collusion of a certain version of a certain application running on a certain version of a certain operating system in a certain version of a certain box, and kept on a certain passing medium such as a 5 1/4 inch floppy." Gleick continues himself: 'If a company has digital business records a mere decade old, what are the chances that it has stored a vintage 1988 personal computer, DOS 2.1, and the correct version of Lotus 1-2-3?' (The anecdote is well known and I think I have the facts reasonably accurate that when the editors of "The Encyclopedia of Science Fiction" wanted to produce their second edition, they found that the text of their first edition was stored on an 8-inch floppy in a word-processor format no-one used any more, which ran on a Word Processor which was no longer being made.) Sure - modern software proprietors recognise this problem and provide their software packages with an assortment of import and export filters - but even these have shortcomings. Two examples from my own experience: While I was working at VicRoads (Oct.'95 to Aug.'96) we upgraded from WordPerfect 5.1 to WordPerfect 6.0. I found that when I imported my old documents into the new version of the program, the font sizes and some of the formatting were stuffed up. (This was from and to the publisher's own software products: a situation where I'd expect the conversion to be flawless.) Now - this old word-processor [WordPerfect for the Amiga v.4.1.12] can save files in WordPerfect for IBM v.4.2 format. In 1995 I did a Professional Career Development Course at a company called BDR. One of our tasks was preparing a new professional Resumé using Microsoft Word. I used CrossDos at home and saved my existing resumé as a WordPerfect IBM v.4.2 document on an IBM-formatted floppy and took the disk to the course venue. I found out to my considerable irritation that Word does not import WordPerfect v.4.2 documents (but handles v.5.0, v.5.1 & 6.0). The Microsoft programmers must have decided that no-one in 1995 would still be using a version of the program that ancient. (Though they still have a WordStar filter - a program I remember using at the Australian Bureau of Statistics back in 1986.) I don't know whether Microsoft user-support will send you (for a price) additional import/export filters for Word if you request them. (Somehow I can't see them being that organised or caring. We ALL KNOW that ALL SENSIBLE AND RIGHT-THINKING PEOPLE line-up outside Harvey Norman's at midnight to ensure they get the latest version of a Microsoft program as soon as it's released!) In short import/export filters are an imperfect solution. The program cannot be guaranteed to import - gliche-free - files created by older versions of itself, let alone successfully import and convert files created by a competitor's program (which has had to be, I expect, reverse-engineered from a secret format.) This brings me to the next topic. The Data CageSteve Sergeant (3) expresses it thus: "We cannot use an alternative product to access our data because software vendors consider formats for storing data to be trade secrets. In effect, software vendors hold our data hostage. They ensure future purchases of their software by leaving us no other way to access or manipulate our data." [Unlike Steve I don't see this as maliciousness on the part of software proprietors. It is an historical accident. They had to create data formats because no suitable ones existed back then [would you want to save your letters in ASCII format?], and once they had invested time and money in inventing a format they had a right to keep it secret to prevent johnny-come-latelies from making easy profits at their expense from all their expensive R&D.] But still - the software proprietors have designed the cage in which our data sits. They control our access to our own files. However the problem is bigger than that. At Footscray Public Library in 1997 I was cataloguing a stack of CD-ROM software, and I found myself wondering what the effective lifetime of these products were. They were all designed to operate under two generations of the Windows operating system (v.3.1 or Windows 95), and I was acutely aware of how temporally-restricted that operating-system paradigm was. Sure Windows 3.1 software can be made to work (with varying degrees of difficulty) under Windows 95 and Windows 98, and I expect Windows 95 software will be usable under Windows 2000 (whenever it is released.) But as time passes - the operating system grows and continues to change and software of an earlier age becomes less-and-less compatible. So what is the lifetime of your software? Will Encarta98 work on the PC you will be using in 2008? Unless you are using the same machine with the same version of the operating system in 2008 as you are now - I'd say a definite "No!" Your datafiles are only compatible with the current version of the application program and the current version of the operating system for a short time (maybe 2-3 years if you are lucky). You can 're-cage' them more or less successfully everytime you upgrade your application - but this is a lot of hard work (converting every file you have ever made with any version of that application), and clearly unacceptable. And applications and operating systems die! What do you do then? Your data is caged and you have two main choices: Without any of these procedures you will in time lose the ability to access your own data. It will have become Dead Data! Here is a brief aside: Whose Data Is It Anyway?The short answer is "Ours." The files we create - whether they contain words, numbers, images or sounds - are our own created copyright-protected intellectual property. Because of this it can be argued we have an implicit right of free access to the datafiles we have created. The 'data cage' of proprietary formats prevents this. However - another factor is now in play. The Emerging SolutionIn the old days proprietary formats didn't seem like an issue. You made the decision to use WordPerfect or WordStar or whatever and once your letter was printed it could be read by anybody, kept in a filing cabinet, or what have you. Word-Processing had the advantage that you could keep electronic copies of letters and modify them for new purposes. Problems came when people started distributing datafiles electronically. Printed documents or spreadsheets only need eyeballs, but when files are sent electronically you have to begin to worry about what software package someone else is using and whether they would be able to import your data. (And the continuing software upgrades didn't help either - how many of us have been sent a document as an email attachment, which required a newer version of the application program than the one we've got to read it?) (Automatically assuming everyone has got Word doesn't help either - you need to know which version of Word they've got!) In this new age when so much information is transfered electronically old proprietary data formats are no longer acceptable. The solution has come along with the invention of the Internet in the form of open standards. Open and documented standards are the *only* way to prevent dead data. (When the bars of the data cage are clearly visible it is easy to figure out how to rescue the data inside.) I'll give you an example. The two common image formats for World Wide Webpages are GIFs and JPEGs. Imagine someone in 2030 AD comes across a JPEG image somewhere in cyberspace. Even in the highly unlikely event that their machine doesn't understand JPEG images, there is such a stack of publicly available information describing the technical details of this image format that it would take little effort for a technically-minded person in 2030 AD to program an application to view JPEG files. (Contrast this with a person in 2030 AD who comes across a Word 2.0 file. Unless Microsoft (or one of their successor companies) has released the technical details of the Word 2.0 file format into the public domain - our techno-savvy 2030 AD person is going to be rather frustrated! They'll be able to grab the text (being ASCII) but will have to make educated guesses about the document formatting, fonts, etc.) There are many open standards in the Internet age. (We even have an open- source operating system - Linux!). There are open standards for images as above, as well as text (ASCII and HTML in a sense, etc.), audiofiles (MPEG Layer 3. RealAudio is proprietary and is losing popularity I believe), videofiles (MPEG, QuickTime despite being invented by Apple has virtually become an open- standard, I'm not sure of the status of AVI files). [Open standards also exist outside of the Internet - the POSTSCRIPT page description language for printers is an example.] It is my belief that in time all the proprietary file formats will die away. (Those that stay longest will be those like PDF files which are supported on a wide range of computer platforms.) The existing situation where open and closed standards co-exist on the Web will not last. [As HTML and its mother the Structured General Markup Language (SGML) and daughter the eXtensible Markup Language (XML) grow and develop new capabilities, and as Internet bandwidth expands - plug-in programs (which are Windows-specific) will also die as their current capabilities are first met and then exceeded.] I read somewhere that Microsoft Office 2000 will save documents in the open-source XML format. (It'll be interesting to see if it is pure XML, or XML with hidden Microsoft extensions!) In short - I am convinced proprietary data formats are a 20th-century phenomena, and they will disappear in the 21st-century. All our documents, our spreadsheets, our audiovisual stuff, will then be created using publicly available open-source data formats. Dead data may be solely a 20th-century phenomena as well. The only real question is how long this will take. (I am waiting for an open- source document format with the power and flexibility of the latest version of Word files to appear. The raw templates - SGML and XML - already exist. If Linux is a fair example of the process of open-source distributed creation - this document format may be only a few years away.) *But what about all the old stuff? What About All The Old Stuff?Old data is important data. Clifford Stoll in his book "Silicon Snake Oil" (4) in chapter 11 remarks how the Internet is a good reference source for contemporary information, but is surprisingly shallow when it comes to historical stuff. (This is slowly changing with museum collections becoming available online - but there is always a gap. Historical information can be as recent as technical information on a 1995-model printer). The Web generally has a shallow memory - the public library is still a better resource for historical information in context! How can this change? The old data must be converted to new formats - photographs and sound recordings digitised, old letters digitised or retyped, old data formats converted to newer formats. All this takes a lot of time, money, and human resources. Will it be done? I'd say only as a side-line to another activity (e.g. running a museum), as a volunteer activity (e.g. the Local Historical Society), or because there is potential profit to be made in repackaging or re-selling the information. [I have grave concerns about how Bill Gates has been paying for the rights to digital forms of images that really belong in the public domain. I can envision a time where you'd have to pay a token fee to download a GIF of the Mona Lisa. I'm unsure about the legal precedents for such things. I'll have to research it further before I can write about the legal, copyright and public interest implications of all this.] The hassles with proprietary data formats have already been discussed. This 'retrospective conversion' is never a high priority for an institution - new stuff is given priority. The data conversion from old storage formats will not be done in time, and I reason that as with The Last Film Search (for old nitrate film stock before it spontaneously combusts) the data will decay before it has had a chance to be converted. A portion of our digital history will die as well. Who will mourn the Dead Data? References(1). Review of a film "Into The Future: On the Preservation of Knowledge
in the Electronic Age" in Scientific American. Vol.278.
No.1. January 1998. p.88. (2).Gleick,James. The
Digital Attic - Are We Now Amnesiacs? Or Packrats? at: (3).Sergeant,Steve. Your Own Data.
Essay and feedback discussion at: (4).Stoll,Clifford. Silicon Snake Oil : Second Thoughts on the Information
Highway. Related WorksSee also The Addenda further down.
Feedback and Discussions
(June 1999) [Printed in "Reality Module No.10."] All The Dead Data - AddendaA (Archiving) .A.Came across this article which touches on many of the issues I raised in my essay of two issues ago - and adds some more. (These are the edited highlights.) Faye, Denis. Storing e-data for e-ternity. in "The Age" (I.T. Supplement). 4 May 1999. p.1&8.
PDF files - as you know - are a sort of compressed version of Postscript printfiles, those files which tell an expensive Laser Printer how to trace graphics on a page so as to produce an accurate printout of a relevant page. This is why the XML metadata is so essential here. The PDF file on-its-own does not contain any information which makes sense to a computer (it is information for Laser Printers and after that human eyeballs). In a way it is analogous to various projects to store digital photographs - the computer has no way of understanding what is in a photograph, and so human-beings have to provide an attached file with machine-readable text, so that the computer has something to search through and index with. .B.And now something that I hadn't thought about - Emulators! "Emulators are pieces of software that allow a computer to behave like
another one." "Businesses are turning to emulators to access files and software that,
while still useful, have become inaccessible with the march of time and
technology." I'll talk about Amigas for a bit - since that is what I know best. (Though the principles are also applicable to Intel machines and Apple Macs.) Most Amigas uses the old Motorola 68000-series of Central Processing Units (CPUs) - like the old Apple Macintoshes. There is a piece of software for the Amiga called PC Task. When you install this program on your Amiga and run it - it makes the 68000 series CPU perform operations in a manner analogous to an Intel 486 chip. This enables the Amiga to run operating systems and software that was written to run on an Intel machine. It will run MS-DOS, Windows 3.1, and any software (memory-permitting) which runs on top of those operating systems. On a 68060 powered Amiga Windows 3.1 runs fast enough to be genuinely useful - though I wouldn't want to run it on my 68030 system. (When the much-awaited version of PC Task for PowerPC equipped Amigas comes out - it should enable Windows 95 to run at an acceptable speed.) In short - PC Task is an emulation program which, without changing the Amiga's hardware in any way, allows it to behave as if it was a PC. Another program ShapeShifter - allows an Amiga to emulate an old pre-PowerPC Apple Macintosh. (This runs much faster than the PC emulator - since Amigas and older Macintoshes both use Motorola 68000 series CPUs.) I have seen the AmigaOS ('Intuition'), PC Task and ShapeShifter running at-the-same-time on a single Amiga! Of course, emulation programs exist for Intel machines and Apple Macintoshes. (There is an Amiga emulator for the PC called Amiga Forever. It requires a reasonably powerful 166MHz Pentium class chip - as it is not only emulating the Motorola CPU but the four Amiga custom co-processors as well!) I have been following the progress of emulation software for several years now. Software that started out as very flaky, crash-prone, and hard to set-up - has become much more stable, faster and easier to install. [You can even emulate computer platforms which don't exist as physical hardware devices. The most famous is the Java Virtual Machine (JVM) - a software only computer architecture which can be emulated on a wide range of computer hardware platforms - and was designed so that exactly the same programs could run on all of them!] My conclusion - the software will only get better! It will improve until it operates transparently and you mightn't even notice you're using it (!) - eg. Your Windows 2010 machine might be called upon to open an 'ancient' file - and it will create an emulated Windows 3.11 machine somewhere in its protected memory, and load that file into an old version of Excel automatically. We won't need to keep a progressive museum of PCs afterall - our grand new PC can emulate all the dead generations at will. .C.The good thing about all this technological obsolescence is - you can pick up a better class of secondhand goods! Copyright © 1999 by Michael F. Green. All rights reserved. Last Updated: 6 January 2005 |