H o m e 
A b s t r a c t s 
M e e t i n g s 
R e s o u r c e s

Electronic Media Group

Website History and Preservation

May 2001

Websites date back to 1991, when CERN (European Laboratory for Particle Physics) prototyped the WWW protocol. CERN has had a long history with the Internet. They have been working directly on the web's development since 1989 and back to the mid-1970s on the protocols leading to the WWW.

Packet switching (the basic file transfer protocol) dates to the mid-1960s. Packet switching allowed Internet to be initiated when ARPANET was created in 1969 for use my the military and a handful of major universities.

The first website has been attribute to Louise Addis who organized the team that developed the first website in December 1991 at Stanford University link on Stanford Reports. The site provided access to the SLAC (Stanford Linear Accelerator Center) bibliographic material. Tim Berners-Lee (CERN), credited developer of the web, called her accomplishment the "killer application" that brought the web to the world.

In 1993, the web had 600 websites. By 1994, there were 10,000 websites; by 1995 there were 100,000 and in 1997 the million website mark was breached; by 1998 there were 2.7 sites; by 1999 almost 5 million. In 2000 there were 7.2 million websites Pandia link .

The massive growth inflection point is 1993, which corresponds to the development of MOSAIC, the first graphical internet browser. Internet history can be found at InternetValley Archives link and CERN link Most websites have lives of 1 months to 2-6 years. No website can be older than 10 years (see above, less than 100 sites online) and most can not trace their existence prior to 1995, which is beginning of explosive growth on the web. Do we have those initial sites archived? Probably not. The Internet Archive project founded by Brewster Kahle http://www.archive.org/>, only goes back to 1996. The Internet Archive accomplishes a full archive of the Internet with Alexa Internet Alexa link robots every 2 months, at off hours, respecting bandwidth and user prime time activity. The Internet Archive presently holds 42 TB of information on disk arrays. Tape has proven unreliable.

The look, feel and functionality of a website are determined by several factors: (1) content, (2) the www itself, (3) hypertext transport protocol (http), (4) HTML or markup language, (5) the web browsers and (6) the host server software which serves the files, online, to the user. Simply put, the Internet, host server and marked-up content work together to assemble the site on a user's computer.

Websites are cultural entities. Librarians, curators and archivist collect websites for their preservation and to be able to use them beyond their online existence; a possible copyright violation. If website content, structure and appearance are not saved and documented, they will be lost. While not all websites require preservation, there are specific significant cultural sites which we need to learn to save effectively.

Websites are (1) archived (2-month cycle), (2) cached and (3) backed-up while they exist online. An archived website will not function as the original website without the server software, specific server configurations, plugins (such as Flash and Shockwave), their updates. A cached website will save content that has been accessed by a user, but not all parts of a site are commonly accesses by a user, so not all of a site's appearance is saved in a cache. Caching, also, does not save a site's functionality. Searching caches will find some lost information, but it can not relied upon for preservation. Saving the website from outside the host server, such as in archiving or caching operations, is a not a preservation effort. Pieces can be preserved, but not the full functionality. The pieces can be use in website archeology, should that be required to reconstruct parts of the site. A backup of the site's content from the server's hard drive saves the sites in a quasi-preservation manner. It does not, however, save the server software or its configuration. True preservation must save content, its hierarchy, server configuration, plugins, updates and details about the site's creation process. Conservators need more understanding of the depth of configuration and creation details needed for preservation.

A feature of today's Internet culture is the speed that events come into existence and vanish. When content is no longer wanted, needed or required it is taken offline because space and bandwidth are economic factors. If the site's content wasn't captured by a server backup, one of the Internet's caching systems, an anonymous proxy server, a conscious preservation effort or the site's authors, it will pass from existence.

When websites go out of existence, the Webmaster performs one or more functions: (1) the host server is configured (very simple) so it will not serve the site, (2) the relevant DNS (domain name server) is informed that the IP address is dead or (3) the server software is shut down. When not active, the website's files will be removed from the precious space they occupy on the server's hard drive. Commonly, the files are copied to another storage medium, such as a CDROM (or they are just deleted). When this occurs, the website enters the realm of storage.

If an author or webmaster wished to remove all traces of a website from existence, considerable work is involved. If the site is over 2 months in age, they would ask the Internet Archive to remove the site from their collections. Known, anonymous and secret website caches around the world would have to be purged. Because the transatlantic [telephone] cable provides such slow internet access, most of the websites in the US that are used in the UK (and Europe) are cached on a series of servers in the UK to speed access time. Universities routinely cache the web so they can save information that may pass out of existence for their researchers and students. Large business run proxy servers that cache new web information requested by staff and then serve up the cache's version when the internal request is made again. Searching for these copies is called "archaeology" in the Internet world. How Internet archeology is done, is just developing in response to the problem of websites going offline and other needs.

Similar to a chair, painting or sculpture, storage is critical for the preservation of a website. Unlike artifact storage however, if the storage medium fails, the content is lost rather than just damaged. But, file storage is not the whole story. Server software, its configuration, a site's documentation and many other factors are important components of website preservation.

Tim Vitale, Chair, tjvitale@ix.netcom.com

With help from Walter Henry, Lead Analyst, Preservation Department, Stanford University Libraries & CoOL Webmaster and John Burke, Chief Conservator at the Oakland Museum of California Art.

Return to the AIC, EMG or CoOL Homepage