Preserving Electronic Data - An Active Archival Process

Harrison Eiteljorg, Archaeological Data Archive Project

In the meeting program was a quotation from Thomas Carlyle, who said, All that mankind has done, thought, gained or been: it is lying in magic preservation in the pages of books. We are accustomed to acting on this notion, so ably expressed by Carlyle, by preserving those pages of books. Of course, we preserve them, whether in archives or in museums, so that people can have access to them, so that people can gain from them, as Carlyle said, all that mankind has done.

But, as we all know, Carlyle's statement is no longer correct. Much of what we know today is not to be found in the pages of books but in the silicon and magnetic memories of computers. More and more of what we learn is not stored on paper on in directly accessible form - though it may incidentally put on paper in part or in whole - but in computers, and the computers and their programs, as well as the data files, are required if we are to access the information. So how do we archive computerized information? In my case, since I direct the Archaeological Data Archive Project, the answers are basic to my work, but archiving information - not objects, but information - is required of virtually all of us here in this room. So I hope that my work may be of interest and value to you.

Computer records consist of magnetic or electronic codes on a substrate. Archival storage of them would seem to be simple. Keep the magnetic or electronic codes and the substrate from deteriorating. But we have two obvious problems. First, we know from experience that we can't prevent the magnetic or electronic codes from decaying. Therefore, we know that, at the least, we must regularly copy our computer files in order to preserve them. (Of course, this is one of the great advantages of computerized information. Exact copies can be made easily.) The new copy, not the original, becomes the archival one, because it has a longer future. Second, and far more perplexing, we also know that, sooner or later, the computers and/or the software used to make the records will become obsolete and go out of use. However, as I mentioned before, the computers and programs are required to access the data. Therefore, when the computers or software can't be found, even properly stored and preserved records will be inaccessible. Whether we preserve the original files through some black magic or make copies, the computer files will eventually become useless.

I am not saying that it is impossible to store computerized information, but it clearly is impossible to preserve computer files - the physical records of magnetic or electronic markers - for a lengthy period of time. They will decay, even if slowly. It must also be considered impossible to preserve computer files for access by simply copying them to keep them in good shape (called data refresh). We can make good, exact copies so long as the hardware required remains current; at some point, though, the hardware or software required for using the files will no longer be available, and then the copies will be inaccessible. So we cannot preserve the original physical item, and making copies of the physical item cannot preserve accessibility.

If there is a way to store computer files in an archive, then, it must be quite different. We cannot preserve the original files or simply copy them to preserve the information and access thereto. What, then, can we do? First, we must focus on our aim - keeping and retaining access to the information, not the physical objects (the computer files).

With that in mind, let's return to some of the problems of archiving computerized information - and some of the solutions. So far, we've discussed two problems - decay of the physical computer files and obsolescence of the files by virtue of changes in computing technology. Ironically, the second problem makes the first one a non-issue. That is, we can now store data on media that will survive a very long time, perhaps as long as a century. The media will have become obsolete long before the data have decayed. That is, CD-ROMs are our longest-lived media now (a century has been claimed), but the CD is already becoming obsolete. The specifications for the replacement technology, the DVD, include a requirement of backward compatible, but what will be the next storage medium? The CD, after all, is only a few years old. Of course, even useable CDs will have files that are inaccessible because of changing software within a couple of years.

The problem of obsolescence is then the serious technical problem. Fortunately, the solution is not so difficult as one might expect - at least in theory. For those of us who have worked with computers for a fairly long time, the solution is a disciplined application of processes we've already used. We have moved from machine to machine, perhaps CP/M to DOS to Windows or UNIX to Windows or PCs to MACs. Some of us have been lucky enough to have tried several of these moves and, consequently, to have tested our tolerance for frustration at new and dangerous levels. When we have made these moves, of course, we have wanted to take along old data; so we have found ways - some good ones and some terrible ones - to translate the data files from one format to another. For instance, we may have taken images from a MAC and put them on a UNIX Web server, requiring us to change the file format, or we may have taken a data table from a PC and put it into a MAC-compatible form.

We may also need to translate data files when we upgrade a program to a new version that requires a new file format, or we may change software brands and need a new format. For instance, Microsoft changed the file formats for Word, Excel, and Access in the latest release of those products as parts of Office 97, and those formats are different from the formats used by WordPerfect, Lotus 1-2-3, and Paradox. Autodesk changes the DWG format they use with almost every new version of AutoCAD, and that format is different from the one used by Microstation.

When we move files to new formats, the process is called migrating the data. It's often not a difficult process at all. In fact, various software packages make this translation process automatic for specific formats. In other, cases, however, it is very demanding. For instance, it's totally automatic to transfer a Word file from the previous format to the latest one. But it can be quite another matter to transfer a database table from FileMaker Pro to Access or to move a CAD model from the proprietary format used by one manufacturer to that used by another.

Allow me to give you a couple of examples. Suppose we are keeping tabs on something complex like an excavation. Batches of artifacts are taken out of the ground in groups, and each group contains items that shared a specific excavation context. When we design a database system to handle the artifacts, we want to be able to access them in those excavation groups. However, we can't know how many artifacts each group will contain, because, in fact, each group may contain any number or artifacts. There are two very different ways to deal with this uncertainty, depending upon the type of database software one is using. I don't want to bore you with the details here, but the point is that there is no way to make a simple transfer between the two disparate database systems. Instead of a direct translation, there must be a series of processes to extract information from the donor system, reformat it to suit the recipient, and then insert it into the recipient system. This process requires both thorough understanding of the databases in question and familiarity with the data themselves. In some cases, in fact, a true specialist's understanding of the data may be required in order to migrate the data.

One more example. If I were to migrate my CAD model of the entrance to the Athenian Acropolis prior to the grand structure of the Periclean period from its current form (an AutoCAD file) to the native format for Microstation (the other major brand of CAD software), I would have a serious problem. I have divided my model into more than 200 different data segments - by building material, phase, condition, etc. But Microstation permits only 63 such segments. An automatic translation cannot be performed, and it would require considerable familiarity with both the programs and the building in question to migrate the data without losing some of the information content.

Given these comments, one might wonder if migrating data is really possible. I certainly think so, since my work centers on two archival projects. We must use the available software and do what we've done all along - make sure the data are moved in incremental steps so that files are not left behind. How often, after all, have we had to discard data? Generally speaking, that has only happened when we've left the data files to gather dust for a rather long time. When those of us who have been at this for a while wanted to bring along our old files, we generally could find a way.

Fortunately, software producers have a real and significant need to provide migration paths for their file formats in order to keep faith with their customers. Furthermore, if one sticks with reasonably common file formats, there are usually translation programs available in the commercial marketplace.

There are ways to keep the problems to a minimum in an archive. For instance, we can the types of files we will accept, requiring contributors to supply data in one of a number of widely used formats (perhaps in addition to the original format). In most cases, it is not difficult to output data to one of those formats; so this is not a significant burden for contributors.

We can also encourage good practices by our contributors in the first instance. For example, I mentioned earlier the use of different database design approaches. We encourage archaeologists to use only the standard, theoretically correct approach. That makes our job easier, but it also makes is easier for a scholar to share his/her data with anyone else, not just the archive. (As an aside, I might point out that one of the most sophisticated archaeological data sets of which I am aware used a non-standard database management system in its initial version. The developer ultimately decided to move the data into a more standard form. The problems have been very difficult, but they have been overcome. And the new form will permit future migration with ease.

We also require contributors to supply appropriate information about their data when they contribute. That information will be required for users - and for us when we must migrate the data. We don't want to have to figure our how the data files were structured without guidance, especially since we are unlikely to need to migrate the data until after the original scholar has joined the supply of material waiting to be excavated.

I want to emphasize the importance of standard file formats. Whether we are talking about formal standard or industry standards, using them nearly always reduces future problems.

Once the data files are in the archive, of course, we will, sooner or later need to do migrating. To govern that process, we maintain a database of our data files. We keep track of each file's format and date of contribution, and we will migrate files according to a schedule determined by developments in the hardware and software industries.

Since we store the files on CDs (two - one on site and one in a bank vault), we do not expect to need to worry about refresh. As I indicated earlier, the formats will go before the files decay; should I be wrong about that, of course, we will also refresh the data by copying to new media.

I believe the technical problems of data refresh and migration are manageable, not trivial, to say the least, but manageable. So far, we have had no serious difficulties, but we do have a couple of interesting cases.

In the case of our CAD model of the Acropolis entrance, for instance, we have kept the file in the format for Release 12 of AutoCAD, though the current Release is now 14. We did that because Release 13 was not well received, and Release 12 files could be used by both versions of the software. Therefore, maintaining the file in R12 format seemed a good idea. We do have a Release 13 version of the file as well, but it has been used internally only.

Since Release 14 has now come out, we will need to make a choice again. The hidden problem, though, is how many such formats can we keep? That may turn out to be as difficult a question as any, though it is a practical one rather than a theoretical one.

I mentioned Microsoft Word a minute ago. In the case of word processing documents, we have the advantage that there are standards that are not tied to particular programs. Text files can be stored in the Rich Text Format (RTF) or in Standard Generalized Markup Language format (SGML). In both cases, the changes in software should leave the files unaffected for some time.

We have had some problems with data sets as well. One contributor gave us spreadsheet files that should have been database files. Another gave us text files to make into Web tables and, from there, database tables. In both instances we had to be some reworking of the data in order to make good data tables. As the archive gets busier, we won't be able to take the time to do these things, but we are learning what to watch out for.

If the technical problems can be overcome, there still remain many practical problems to be faced.

One such question that arises quickly is that of vetting the contributions. We plan to establish review panels for various subject areas, but, I must confess, we've not yet had so many offerings that we've chosen to limit the contributions. It will be necessary to find some way to vet the contributions, though, just as we would review a manuscript. Otherwise, the archive is of questionable reliability.

Another important question involves the nature of the archive. Are we talking about a physical or a virtual archive. That is, do we mean to propose a single location for the archive, a place where all the files will be stored, or a virtual archive consisting of many physically separate archival operations seen as s single source through the magic of the Internet? The answer, of course, is that the archive will be a distributed, virtual one. There are already multiple archives, and there are already cooperative approaches. For instance, we are now working with the Archaeology Data Service in York, England.

All those who build archives need to work together to avoid duplication of storage - and duplication of migration efforts. There will be considerable amounts of sweat and tears (one hopes not blood) spent determining how to migrate data, and the efforts there should not have to be duplicated..

There are other areas that you may wish to ask about (I hope there will be time for questions), but the last I want to discuss today is the most difficult - getting scholars to contribute their data.

Some scholars worry about the security of their data in a quasi-public archive. They worry that their data will be misused in some way to justify or support arguments they think invalid, and they see the potential for someone taking the data, altering it, publishing incorrect analyses, and claiming that the data support the analyses. In fact, though, having an archive provides proof against such misuse. There will always be a real and verified copy of the original data files available. Anyone can obtain a copy of the original, test hypotheses on the original, and verify results. Without an archive, only files received from the original scholar can be counted upon. With the archive, accurate copies are always available.

Many scholars assume that their institutions - usually universities or museums, of course - will be responsible for archiving electronic files as they have been for archiving paper records. That should be the case, in an ideal world. However, the problems of data migration - and the economic problem of maintaining rarely accessed data in perpetuity - have not been faced by many research institutions. So, while many scholars are implicitly relying on their institutions to archive their data, such reliance is ill-advised.

There is a hidden reason why some scholars will not contribute their data to an archive. They are unwilling to permit others to see the raw data for fear they will lose control of the information and its interpretation. This we can do little about, alas, but I think the ethical requirements of the field are changing here.

Finally, there is a problem for us with the perception of electronic publication. Unfortunately, some scholars equate electronic publishing and archiving. There seems to be something about putting materials on the Web that people think is equivalent to putting them into an archive. This is a small problem, but it is clearly something to note. After all, the migration problems are no different with materials on the Web, and there is a host of other difficulties in addition.

Despite the problems I have enumerated, I want to conclude with some optimism. There are good reasons to expect more and more archiving of archaeological data to happen in the not-too-distant future. Professional organizations are beginning to push for archiving computer records specifically. Governments are beginning to talk about this issue as well. More important, scholars who have been early adopters of the technology are now finishing the projects on which they did pioneering computer work. They are the ones who understand the importance and value of the computer records. They will again be in the vanguard as they contribute data to an archive.

Finally, I want to remind you that data are as precious and irreplaceable as works of art. We must preserve our records, just as we must preserve works of art, electronic or otherwise. As I hope I've shown, this is not really technically demanding work. We all know that it is often unappreciated, but, sooner or later, its importance and value will be clear. It's one of our jobs to make it sooner.