Why I chose the World Wide Web as a repository for archival material
Simon Pockley (+61 3 94897905
The markup in this document has been altered to meet the markup and style conventions used in CoOL.
NB. An updated version of this document, renamed Killing the Duck to Keep the Quack is available at http://www.duckdigital.net/FOD/FOD0055.html
There is a very real possibility that nothing created, stored and disseminated electronically will survive in the long term. The problem does need to be stated this dramatically. I have an unfailing sinking feeling whenever anybody links the concepts of digitisation and preservation. I have a profound and unchanging belief that these two concepts do not belong in any sense in the same world.Maggie Exon:
Long-Term Management Issues in the Preservation of Electronic Information1
The digital era has been characterised by technological obsolescence and ephemeral standards, ironically threatening the usefulness of digital information. There is little firm ground upon which to build the institutional and private structures necessary for the effective preservation of this material. Nowhere are the challenges more difficult than those concerning the new networked medium of the World Wide Web. The vitality and flexibility of this medium mean that digital material is in a state of constant proliferation and mutation. It is the thesis of this essay that rather than being a difficulty, these mutable qualities should be seen as providing an archival advantage. Like a rapidly mutating virus, a proliferating networked medium is capable of carrying digital information not only across platforms and standards but also from old to new technologies. That much of the content of this information no longer resembles the material we currently classify as information is the result of a paradigm shift towards a new form of cognitive representation.
After my father's death in 1990 I found some negatives in a rusty cigarette tin. They were an unusual size so it was not easy to have prints made. The photographs2 were of central Australian Aborigines taken during a camel expedition in 1933. For me, these images are invested with meaning; they trigger childhood memories of stories I heard about the journey as a child. Now, they may well be the only existing record of these people. They carry a form of responsibility to send this information into the future.
The richness and acuity of history is often anchored and lent credibility by such innocuous pieces of information or trivia. They are seldom regarded as important in their time and survive by chance. The story of a civilisation can hang on a fragment of inventory, or evidence from the life of an extraordinary person, on a detail in a photograph.
In 1977 NASA manufactured an archival artifact after deciding to put a message aboard the Voyager3 spacecraft. This was as much a communication through time as it was through space. It will be 40 thousand years before they (Voyager 1 & 2) come within a light year of a star and millions of years before either reach any other planetary system. Using the technology of the day, NASA produced a 12 inch gold plated copper phonograph record encased in a protective aluminium jacket. The record (encoded in analog form) contained sounds and images representing life on earth. It was a remarkable and prescient exercise which serves here to focus on two important archival issues. First, the difficulties in attempting to accommodate technological change. Second, the importance of selection.
The Voyager missions were on the threshold of the digital era. Since then, the move to digitisation has been driven as much by a wish to disseminate as to send information through time. Another of the ironies of this era is that just as we have evolved 4 the capacity to record and disseminate almost every facet of an entire human life (or even the era itself), we have become so overwhelmed by the sheer quantity of data that we find ourselves asking, "What can we delete?"
Selection has always been a key word amongst the custodians of our records in libraries, archives and museums. When significant collections of material in digital form can be dispersed in multiple locations, re-collecting takes on a new meaning with new vulnerabilities. The currency of the term Cultural Memory has more to do with an anguish over our collective ability to forget than with any confidence in the reliability of recollection.
In the past the custodians of cultural records had to continually balance the preservation of manuscripts, books, recordings, film and video against their availability for public access. In spite of Marshall McLuhan alliterative pronouncement ('the medium is the massage') the process of digitisation reminds many of us that information can usually be separated from the medium which carries it. This loosening of the bonds imposed by medium, allows data in almost any form (text, sound, image) to be reused and recombined with a facility which we have yet to learn how to exploit.
With electronically-stored information, the paradigm shift from concern about durability to concern about permanence has been completed. We may worry about hackers but we do not worry about genuine use. In fact, we revel, we positively boast when we can show an exponential growth in the use of the information services we provide. It requires a large shift in perception to realise that the best chance electronic information has of being preserved is that it should go on being used, regularly and continually. As soon as it is not used, it is in trouble.Maggie Exon: Long-Term Management Issues in the Preservation of Electronic Information5
In the digital environment the links between selection of materials, provision of access to those materials, and preservation of them over time is so inextricably linked that at the National Library we tend to talk increasingly simply of providing short and long term access rather than even making a semantic distinction between preservation and access.Maggie Jones: Preservation Roles and Responsibilities of Collecting Institutions in the Digital Age6
A digital world is also a fragmented world where we no longer trust what we see or hear. This is not the place to examine the many and various manifestations of this anxiety. But it is a world where loss of memory is endemic, where the near past has been almost obliterated by an accelerating present and where place has lost meaning. The rush to digital technology has been so sudden that we have barely had time to work out how we can practically apply it to records of the past, let alone consider the questions necessary to ensure the durability of the present. Nor have we had time to reflect on the stability and durability of the means of mediation; computers and software.
While loss of data associated with the deterioration of storage media7 (CD-ROM, DAT tape etc) is important, the main issue is that both hardware and software become rapidly obsolete. Who today has a punched card reader or a working copy of FORTRAN 11? Information from the first Landsat satellite is now irretrievable because no working machine can read the tapes used for storage in the 1970s. Every time a word or image processor vendor introduces an improved update with a slightly incompatible file format, more information becomes inaccessible. For example, Windows'95 will not run many of the existing DOS and Windows programs and hence information archived using DOS or Windows is lost to Windows'95 unless new versions of the software programs are released. An interesting phenomenon has been the abnegation of archival responsibility by the individual to the technology itself (the computer crashed) Those of us, who use computers, can provide numerous examples of data loss often for reasons shrugged off as inexplicable.
Devices and processes used to record, store and retrieve digital information now have a life cycle of between 2 - 5 years. Far less than some of the most fragile physical materials they are seeking to preserve. The practice known as 'refreshing' digital information by copying it onto new media is particularly vulnerable to problems of 'backward compatibility' and 'interoperability'. Visions of computers capable of emulating obsolete hardware and software have no basis in economic reality.
An even greater vulnerability is that the custodians of the information, be they private or institutional, are unlikely to be able to bear the costs and complexities of moving digital information into the future. They will deliberately or inadvertently, through a simple failure to act, render the information irretrievable. Most institutions have yet to develop a body of knowledge and experience, or infrastructure, to deal with these issues. Usually the preservation sections have the preservation management skills but not the technical skills or equipment. Even after 30 years they have yet to properly define the true nature of this problem.
The greatest fear the report raises is that in a world when more and more cost-justification is required, the owners of information will not take the steps needed to keep it available; nor will they permit others to do so; and much will disappear.Michael Lesk: Preserving Digital Objects: Recurrent Needs and Challenges8
Some of the institutions embarking on the collection and provision of access to digital information have begun to ask difficult questions about the archival issues of providing long term storage and access for digital material. The National Library of Australia recognises that this is more than just a technical problem and has developed a flexible approach which can evolve with the technology.
Whether the creator is an organisation with both the commitment and resources to maintain digital information over time themselves or not is likely to determine the nature of the relationship between them and the Library. For example if the creator is not in a position to maintain digital information over time but the information is considered to be significant, the National library may well undertake the responsibility for maintaining it if another institution is not considered to be a more appropriate site. These national mechanisms have yet to be worked out - other institutions take on roles as both facilitators and active participants in preserving digital information.Maggie Jones:
Preservation Roles and Responsibilities of Collecting Institutions in the Digital Age9
In December 1994, the Commission on Preservation and Access and the Research Libraries Group created the Task Force on Archiving of Digital Information10 to investigate the means of ensuring "continued access indefinitely into the future of records stored in digital electronic form." Their draft report is essential reading and is the source of most policy in this area. The Task Force divides digital information into two distinct categories or kinds of objects (a term derived from programming):
Institutional research and thought has concentrated on the preservation of the first group (document-like objects) primarily because these are, to date, the bulk of all archival holdings. Today, relevant and vibrant culture is more likely to reside in the second group. Either way, our ability to retain both groups of digital objects depends on a number of inherent properties which give both of these objects their versatility.
At a binary level they can be represented by Os and Is. The length of the binary sequence varies according to the operating system. The meaning of each sequence varies according to the kind of information it describes (text letter, number, image etc) An important function of a code is to represent characters in a standardised way. It must be able to translate human communication into something that digital computers can understand, and it must be standardised so that computers can communicate with each other without a loss of data integrity.
For the archivist the basic challenge is to be able to migrate the structure and content of information through a maze of competing digital coding systems. The source of this maze can be found in the history and development of character encoding standards.
Character coding really began in the 19th century with the first attempts to automate the typesetting process. It was the telegraph that led to the development of remote typesetting and subsequently the growth of newspaper chains. The first standard (the Baudot code) was too limited to reflect the 26 letters of the alphabet so several shift keys were used to increase the available codes. Later, this served as a model for the option, alt, control and command keys on todays computers. A variant of Baudot's code, Teletype Services (TTS) code. was introduced by Walter Morey. However, the problem with both these shift codes was that, if they were lost in transmission, all subsequent codes were misrepresented until another shift code was received. This problem was not solved until the 1960s.
There are two ways to fill a technological gap such as that which existed for character codes. First a group of companies can get together, spend months or years in careful consideration and bring forth a standard. The other way is for one company to create its own solution, quickly implement it, and expect all the other companies to follow along. Both of these paths were taken, producing two competing codes.Leibson cited by Jenny SandersThe History and Function of ASCII Code11
ASCII (American Standard Code for Information Interchange) and EBCDIC (Extended Binary Coded Decimal Interchange Code) were the result of these approaches; each is used by a different type of computer. One of the main difficulties in translating one into the other is that each system uses a different type of keyboard with keys on one that simply do not exist on the other. This is further complicated by the use of different versions of ASCII. Similarly, there are number of differences between the translation of ASCII on a Mac and on a PC. At this stage ASCII is the most widely deployed representation of text, and in the interest of interoperability, information exchange on the Internet relies on it almost exclusively. However, the Internet reaches communities all over the world. If it is to become a significant cultural force, the needs of languages using non-ASCII character sets will eventually have to be addressed. At the moment the development of a new character coding system by Xerox (with even more codeable characters) known as Unicode appears to have the support of many of the major computer players. It is not difficult to see similar processes at work in almost any area of computing.
Just as the development of the Elizabethan printing press led to variations in text which sparked 400 years of Shakespearian textual scholarship so the evolution of character coding may well provide the digital archaeologists with years of employment.
Electronic texts have been used in textual scholarship for nearly 50 years. It is only recently that they have begun to be used in libraries where they are expected to be reusable and accessible. Markup, a kind of metatext, can provide formatting information, designate hypertext links and assist in search and retrieval. Standard Generalised Markup language (SGML) became an international standard in 1986. It has the advantage that it is not only independent of encoded text but also of hardware or software and can be transmitted across all networks.
Behind every screen on the World Wide Web lies ASCII text marked up with Hypertext Markup Language (HTML). HTML is a more simple but universally recognised version of SGML. Its simplicity allows people with almost no computer skills to create, format and disseminate digital material in a networked medium. HTML is limited in its expressive ability because it does not include many aspects of SGML document description. As a result, it is rapidly being extended. At this stage the market dominance of Netscape has made it likely that the Netscape extensions will become standard. Much of the world's information is now managed as SGML. In order to increase the contextual abilities of the web as a manager of this information, the direction of evolution is to make web user agents able to receive and process generic SGML in the way that they are now able to receive and process HTML.
The Multipurpose Internet Mail Extensions (MIME) introduces Internet Media Types, including text representations other than ASCII. HTML is a proposed Internet Media Type as well as an SGML application. In the MIME and SGML specifications, however, character representation is notoriously complex, and the two specifications are inconsistent and incompatible. The Internet Engineering Task Force (IETF), and the MIME_SGML, HTML and Hypertext Transfer Protocol (HTTP) working groups are attempting to rectify these inconsistencies and are discussing the best ways of incorporating text representations other than ASCII.
From this point text rapidly becomes overwhelmed by the acronyms representing various and competing specifications, formats and markup variants. Such is the speed of development that it become difficult for even the most active participants in these races to keep up. It is equally hard to predict what, if any, standards will evolve. There is no reason to doubt that the same leapfrogging processes which applied to character encoding will not apply to Markup languages. HTML is also capable of carrying specific programs such as Java Applets which serve to automate many interactive tasks. The dynamic qualities of these programs have profound implications as they transform the World Wide Web into a new medium.
As we plunge headlong into a networked mediation of knowledge, the very nature of digital information has expanded beyond the notion of stable encoded objects. To properly understand how this contributes to a new medium requires a paradigm shift. It is made more difficult by not yet having the semantics with which we can describe this medium. A prime example is the misleading and persistently redundant metaphor of the 'home page'. The term 'screen', as a reference to the visual entity, is surely better, not only because it is more descriptive of what we see but also because it refers to what is hidden (the markup and encoding behind the screen).
First, it is necessary to abandon ideas about the finished work and to redefine concepts of the published work. Above all, this is a proliferating medium of rapid, if not instant, global dissemination. For the archivist, the notion that we can save a copy of every work published is as absurd as it is to think that each screen might be anything more than an evolving variant of a continuous stream of material. The screen might only exist in virtual form during the time of access. Outside, this moment of access, the screen consists only of a set of rules and references to fragments from other sources from which this screen is to be derived. It is inherently unstable not because it is unable to carry these fragments but because the fragments themselves may be continuously changing. The screens resulting from the actions of search engines are prime examples. Add to this live feeds of video and sound (not necessarily from the same places) and we see that we are, in fact, generating a new representation of conversation and thought. One might usefully swap the word idea for screen in order to get closer to a description of the dynamic processes involved.
Much of the information is networked and has not tangible form. There is no obvious link to whose responsibility it is to maintain it, no obvious way of being able to tell whether it is in fact endangered, and no easy way to find it in the first place, much less make an assessment of its significance.Maggie Jones:
Preservation Roles and Responsibilities of Collecting Institutions in the Digital Age12
The impact of metadata on this same material by networked providers, archivists, collectors, researchers, commentators and special interest groups is just beginning to be felt.
Metadata springs from the self referential quality of networked media, it is literally, data about data. Metadata might even outweigh the information itself. Metadata might include navigational aids, discovery aids, access counts, guest lists, data bases, combing screens, emails, chat group references, even essays like this, which itself contains metadata. Most of this referential information, even archival information, is similarly live and in a state of constant change as, like the material it describes, it evolves beneath the weight of comment, input and upgrade. The extent to which this metadata is capable of being separated from the information it describes, influences or extends is a difficult question. It is a question made even more complex when the context of the information is evolving with the medium itself.
However, in order to avoid infinite regress and ask questions about the usefulness of metadata as an aid to finding information the participants of the OCLC/NCSA Metadata Workshop13 recast this question into: how can a simple metadata record be defined that sufficiently describes a wide range of electronic objects?
It should be pointed out that in both form and content metadata mirrors the role it is perceived to play. So far it is primarily text and resembles the metadata that describes traditional printed texts.
without appropriate access mechanisms, preservation of digital objects is a thorough waste of everyone's time, money and expertise. Because scanning millions of documents -- even if those documents are the fundamental documents of a nation -- is a useless enterprise unless we're also figuring out how to create interpretive structures that a researcher can navigate unassisted.Michael Lesk: Preserving Digital Objects: Recurrent Needs and Challenges14
Since networked media contains more information than professional abstractors, indexers and cataloguers can manage using existing methods and systems, it was agreed that a way to obtain usable metadata for electronic resources is to give authors and information providers a means to describe the resources themselves, without having to undergo the extensive training required to create records conforming to established standards. As one step toward realizing this goal, the major task of the Metadata Workshop was to identify and define a simple set of elements for describing networked electronic resources. To make this task manageable, it was limited in two ways. First, it was believed that resource discovery is the most pressing need that metadata can satisfy, and one that would have to be satisfied regardless of the subject matter or internal complexity of the object. Therefore, only those elements necessary for finding the resource were considered. The second was to provide mechanisms for extending the core element set to describe items other than document-like objects.
Since the difficult work of identifying a simple but useful specification for the description of networked resources has only just begun, the major accomplishment of the Metadata Workshop was to define the problem and sketch out a solution.
Some of the most useful metadata is produced by persons other than librarians and document owners, and it can be found neither in card catalogues nor in self-descriptions of the documents themselves. Many kinds of such "third-party" metadata (e.g., bibliographies) are indispensable aids to information discovery. It should be possible to allow topic-oriented metadata documents with semantic network functionality to be cooperatively authored, interchanged, and integrated into master documents. Such documents (and amalgamated master documents) might resemble traditional catalogues, indexes, thesauri, encyclopediae, bibliographies, etc., with functional enhancements such as the hiding of references that are outside the scope of the researcher's interest, etc.Stuart Weibel: OCLC/NCSA Metadata Workshop15
The workshop proposed a set of thirteen metadata elements, named the Dublin Metadata Core Element Set (or Dublin Core, for short). The Dublin Core was proposed as the minimum number of metadata elements required to facilitate the discovery of document-like objects in a networked environment such as the Internet. The syntax was deliberately left unspecified as an implementation detail. The semantics of these elements was intended to be clear enough to be understood by a wide range of users.
There are problems to solve concerning the methods by which different data types and standards can be accommodated by an extended core. The basic difficulty with this description is that it takes the concept of document-like objects too literally and treats the mutating screen as if it was material in print. As a description of a discrete (and printable) screen there is little allowance for context, for variants or for the mutating elements which give each screen its dynamic qualities. The problem is essentially semantic because textual description is difficult without agreed terminology. In the past, when faced with this problem, we would quickly invent new words, discourse would continue and a typology would evolve. However, the World Wide Web has blurred distinctions between image, speech and writing to the extent that it is extremely difficult to separate concise discourse about the Web as a cybernetic system from the system itself.
One need only enter one of the graphical on-line chat sites such as The Palace16 to see how speech (and emotion) has become entangled with text and image. The consequences for semiotics are profound, particularly where writing and images are also reorganised graphically through hypertext links. It also becomes impossible to define the participant as either reader or writer. Each participant reads an individual path by actively selecting a series of links which, in turn, transform that participant into the writer of a kind of dendritic trace of ideas. In this way hypertextual reading, writing and thinking mirrors the cognitive process itself. What is new is the representation of this process not just because this path can be traced back but because it can be recorded.
The Net navigator, or cybernaut, has learned to find her way around in the rhizomatic flood of hypertext links. She knows that there's no original text, no 'actual' document to which all other documents are to be related. She's figured out that on the Net it's primarily a matter of forming small machines, creative text designs and sensible images out of the manifold and dispersed text segments. These machines, designs and images, which didn't exist previously in this way and won't continue to exist in the future, are ontologically transitional in type.Mike Sandbothe The Transversal Logic of the World Wide Web: A Philosophical Analysis17
In theory every screen on the web could be integrated into a single site or period of access. Or, each site or access could be seen as representing fragments of an evolving idea. Where does this leave the archivist? Is archiving possible?
The archival project,The Flight of Ducks (to which this essay is metadata) is in many ways a pilot project. It was, and continues to be, constructed publicly on the web and is made up of nearly 1000 screens. 600 of which are currently published. Apart from the fragile source material from which this site is derived, the site exists only in digital form. To my knowledge, its digital content is spread over a number of platforms and media: a Silicon Graphics Indy workstation, an Acorn RISC PC, a University mainframe, Zip Drive disks, variously formatted floppies and a Mac readable CD. Interestingly, it has also been captured twice by a 3rd party on CD. When I look at these captures the site is almost unrecognisable, such has been its evolution. It is proposed that the site will find a place at the Victorian State Film Centre's digital library - Cinemedia. The National Library of Australia have also expressed a need to capture it and it may fall into a proposed global capture by the University of RMIT. Very few, if any, of the issues I have raised in this essay have been addressed by these institutions. If digital material of significance is to be available for future generations it is very important that these issues are considered both by individuals and institutions.
Technological obsolescence is only a part of the problem in the preservation of digital information. The World Wide Web is a flexible carrier of digital material across both hardware and software. Its ability to disseminate this material globally, combined with its inherent flexibility, allow it to accommodate evolving standards of encoding and markup. Survival of significant material on-line is dependent on use and use is related to ease of access. Inexperience and lack of infrastructure are the primarily threats, both are inherently human problems. The first step towards finding appropriate answers to these problems is to find the right questions.
My greatest concern is that we may be trying to be too elaborate; the plethora of projects, organisations and competing systems is a worry. In particular, everything we are building now appears dependent on elaborate webs (world-wide or otherwise) of cooperation. Maybe the idea of localised, bounded collections has not quite yet had its day and the world's large collecting institutions should not so readily divest themselves of their selection and collection responsibilities in favour of vaguer notions of the coordination of devolved responsibilities.Maggie Exon:Long-Term Management Issues in the Preservation of Electronic Information18
Created in December 1994, by the Commission on Preservation and Access and The Research Libraries Group. The purpose of the Task Force was to investigate the means of ensuring "continued access indefinitely into the future of records stored in digital electronic form." Composed of individuals drawn from industry, museums, archives and libraries, publishers, scholarly societies and government.
The Memory of the World programme was launched by UNESCO in 1992. Its aim was to safeguard the world documentary heritage. One of the very important parts of the programme has been, since its beginning, the digitisation of original documents.
I have a profound and unchanging disbelief that these two concepts belong in any sense in the same world.
Timestamp: Monday, 06-Jun-2011 10:17:51 PDT
Retrieved: Monday, 27-Jan-2020 03:48:07 GMT