A Hybrid Systems Approach to Preservation of Printed Materials

The Issues


What Are the Advantages and Disadvantages of Each Technology?

Micrographics

Advantages: As a storage medium, microfilm is durable and relatively inexpensive. Standards for creating, processing, storing, and reading microfilm are well known; the equipment necessary to read microfilm is not likely to become obsolete (all that is needed is light and magnification); microfilm copies are recognized as legally acceptable substitutes for original documents; microfilm can theoretically store high-quality greyscale images inexpensively; and it is a recognized archival medium (ANSI IT9.5-1988, ANSI PH1.67-1985) with a large installed equipment base. See Figure 1.

Disadvantages: Film can become scratched when handled; consequently, archival film is usually stored in a vault, and only copies are distributed for general use. Each generation or succeeding copy loses resolution (about ten percent). In addition, most micrographics reader/printers must access the film manually; reader/printer blowbacks (printouts) are of poor quality; film creation variables are difficult to control; film quality can only be determined after filming is complete; and bad pages must be re-filmed and spliced in.

In addition, there is no way to selectively tune the input process to maximize quality based on page content. Some preservation projects require filming two exposures of certain pages--a high-contrast exposure to effectively capture the text and a low-contrast exposure to capture photographs more faithfully. Even with this approach, certain color combinations don't photograph well, such as black print on a red or blue background. (Some preservation microfilmers have developed a special film-processing chemistry that improves the tonal range of greyscale images while preserving the contrast--in essence giving the user the best of both worlds-- greyscale and text). Finally, the practitioner must be aware that most of the microfilm produced by the typical service bureau for records management does not meet preservation standards.

Digital imaging

Advantages: The digital image format offers ease of access; excellent transmission and distribution capabilities; electronic restoration and enhancement; high-quality user copies; and automated retrieval aids. Notice that the primary focus is on improving user quality and providing better access to the information. See Figure 2.

Disadvantages: The technology is relatively new; a digital image, displayed or printed, is not yet acceptable as a legal substitute for the original; standards are lacking in many areas; digital storage is not considered archival-- it requires continuous monitoring and eventual or periodic rewrite; the drive systems will inevitably become obsolete; there are relatively high but rapidly declining storage costs; the cost to store high-resolution archival images increases as the quality increases; and greyscale images require even more storage space.

Summary

Micrographics: A mature technology, generally accepted for preservation of printed materials. High quality and low cost. Major weakness-- inadequate access and distribution characteristics.

Digital Imaging: Most promising future technology for preservation of printed materials. Rapidly evolving in quality, speed of access and economics. Major weaknesses-- the technology is fairly new, data storage requirements for archival quality images are high, it lacks standards and is not a proven archival storage media.

The Optical Disc

The improving optical disc access solution: Access is the other side of the preservation coin. It is one thing to preserve a corpus of knowledge for future generations; it is another, and completely different objective, to provide researchers access to preserved materials in a way that will not damage them. In reflecting on this dichotomy, Bill Nugent, a visionary in the field of imaging and optical disc technology, says, "...[T]he dual objectives of the preservation of materials and providing ..public access to them are t opposed to each other. Preservation generally means a strictly controlled physical environment, watchful custodial care, and limited public usage. High public usage generally means accelerated wear and deterioration. But page images preserved on digital optical disc or in a hybrid system can now meet both objectives without conflict, since no wear results from the low-power laser beam used to read the data from the disks." * Clearly, optical disc, used in a hybrid system in a hierarchical fashion, fulfills its access role quite effectively.

In addition, the fact that researchers will no longer have to travel to the physical location of the collection, the increased ability to gain access to multiple collections simultaneously, the ability to accurately and quickly retrieve very selective information, and, finally, the ability to have access to high-quality copies of historic documents are just not possible with any media but electronic. Since this increased access capability adds value to the research process, it has the potential to allow the institution to self-fund some of the preservation costs through revenues generated from charging for this improving access to these archival collections.

High capacity "permanent" storage: The optical disc was one of the primary technologies that made digital imaging practical. Digital images require huge amounts of storage space. The optical disc promised high-capacity, permanence, removability, and random access-- all at an inexpensive price. The advantages of the optical disc as a storage technology are listed in Figure 3. Since the optical disc is read by a laser beam, and since its metallic surface is encapsulated in plastic or glass, it has high resistance to wear during use.

There are several kinds and sizes of optical discs. The one usually discussed for preservation is the write-once-read-many (WORM) disc. It is written with a laser beam that burns holes into its metallic surface. Once data is written to the disc it cannot be erased. If an error is made and the data must be rewritten on the disc, it is rewritten in a new area, thus leaving an audit trail*.

Other types of optical discs include read-only memory (e.g., CD-ROM and the videodisc) and the newest member of the family: Erasable. The erasable optical disc is viewed primarily as a replacement for magnetic tape and magnetic disk. Since it can be erased and rewritten, it is not usually considered for archival storage purposes.

The CD-ROM and videodisc are primarily distribution media; however, they have the same characteristics for longevity, removability, and error correction as their write-once cousins and could be used in an overall hierarchy of storage for effective storage of preservation documents. This is particularly true with the introduction of the write-once CD-ROM, which because of the low cost of the media and the fact that it can play in a standard CD-ROM drive, should be very attractive for use as a preservation access media.

Optical discs: how long will they last? Bill Nugent defines optical disc longevity as follows:

"Longevity is the expected duration between the time of manufacture of an optical disc and the time one of its important parameters degrades to a point where the disc becomes unsuitable for use or to a measurable point pre-defined as "end-of-life" for that parameter. An example would be a disc's bit error rate (BER)* degrading to 1.0 X 1OE-04, a defined end-of-life point for 5.25 inch write-once optical disks."[2]

He says that by conducting a series of accelerated aging tests, one can statistically determine an expected end-of-life for an optical disc based on the increase in the bit error rate. Once determined, the bit error rate of each disc can be monitored to predict approaching end-of-life and allow the disc to be copied while its integrity is still guaranteed. Since optical discs contain two levels of error correction, discs in the early stages of degradation can be recopied with no loss of data.

Longevity is critical in preservation applications. Optical disks will not be comfortably accepted (for archival storage) until longevity, decay rates, the physical nature of failure mechanisms, and a strategy for rewrite based on scheduled monitoring using prescribed test procedures (or scheduled rewrite procedures) have been established.[3]

Redefining "archival": When one thinks of defining archival, the definition "preservation of a document for about 500 years" comes to mind. This definition works well for information that can be interpreted by the eye, because the eye has remained the same for hundreds of thousands of years. However, technology advances rapidly. The information stored in electronic format must be interpreted through computers or computer peripherals for it to be intelligible by humans; however, two factors influence the ability to gain access to this information: the permanence of the media and the life of the technology needed to provide access to the information. The fact that digital storage media may last for 100 years or more has little meaning in and of itself. In this case, "archival" should be redefined as the ability to recreate an exact copy from the original medium before it degrades or the technology to read it becomes obsolete.

Impact of obsolescence on the digital approach: The National Archives, in its report "Preservation of Historical Records," claims that optical discs can never be used for permanent (I believe they mean archival) storage. The Archives is concerned about the problem of obsolescence. They cite as an example the 1960 census, which was the first to be automated. In 1970 archivists discovered there were only two computers in the world that could read the 1960 census data. One was in the Smithsonian, the other in Japan. We supposedly know less about this first "automated" census than we do about the census of 1860, 100 years prior.[4]

Obsolescence is a key concern for the designer of any digital image system. The fact that the storage device will become obsolete will require that the media be recopied every five to ten years.

Preservation through rewrite: The practitioner can monitor the media as suggested by Nugent, or adopt a policy of scheduled rewrite. There are those who feel that whichever strategy is employed, rewriting the prior generation of digital storage media onto the next generation will be cost effective because of advances in technology. However, by using the concept of the hybrid system and employing film as the system archive, the need for this rewrite (refresh) cost could be reduced or completely eliminated from the lifecycle of the system. After all, film, as a storage media, is still less expensive than optical disc, and even though the archival film needs to be stored in a vault, these storage costs will remain less than the digital media refresh costs for some time to come.

Assuming the concept of the storage hierarchy is applied within the context of the hybrid system, only a small percentage of the preserved documents (the most frequently and most recently used) will be in digital format at any given point in time. This could substantially reduce the preservation system operating costs.

A final very real concern with the need to effect preservation through rewrite is that in tough economic times refresh costs could be cut from the budget, or for whatever reason, a policy of selective rewrite, or censoring, could be adopted. Can we really rely on those who will follow us to assume the recopying responsibility?

Resolution, the Key Design Element

Micrographics

Film resolution: Film resolution is typically defined as the ability to render visible fine detail of an object; a measure of sharpness, it is expressed as the number of line-pairs per millimeter (lppm)* that can be "resolved". A line-pair is one black and one white line juxtaposed. A series of line-pairs is said to be resolved if all lines in an array of line-pairs on a test target can be reliably identified. Film resolution is measured by photographing several test targets, and under a microscope, determining the smallest pattern on which the individual lines can be clearly distinguished.[5] See Figure 4. Research Libraries Group specifications require that a resolution target be part of the initial sequence of frames for each book on a film reel, and that the measured resolution be about 120 lppm, or a ten target.[6]

Effective film resolution: Theoretically, microfilm is capable of storing resolutions of 1,000 lppm, but this theoretical limit is actually never achieved because even the best microfilm cameras operating under ideal conditions are limited to about 200 lppm. And, due to variations in lighting, exposure control, lens quality, focus, development chemistry, camera adjustment, vibration, and other variables in a production environment, high-quality 35mm 12X film is usually imaged at an effective resolution of about 120-150 lppm (The RLG standard identifies any resolution above 120 lppm, at a 12X reduction, as being excellent). This effective film resolution equates to a digital binary scanning resolution of approximately 700-900 dpi. It will be a few years before cost-effective digital image systems capable of handling this level of resolution are available on a production basis. (See Appendix A)

Film is resolution-indifferent: A single frame of film can store an image at the maximum possible resolution for the film/camera combination being used. Film does not exact a premium for maximizing resolution. On the other hand, the cost of storing high-resolution digital images on any medium except film increases linearly as the resolution increases. This occurs in the digital image because with higher resolution more data points are required to accurately preserve the fidelity of the image. More data points demand more memory for storage. Film, on the other hand, is resolution-indifferent.

Film integrity: Archivists are comfortable preserving materials on microfilm, because they know that--assuming the film is manufactured, processed, and stored according to established standards--they are creating a permanent record that will possibly last hundreds of years.

Digital imaging

Background: Digital imaging technology is viewed by many as a replacement for microfilm; however, that perception is not completely accurate. It will be a few more years before optical disc will be a cost-effective storage medium replacement for film. In general, most people are familiar with micrographics. Conversely, many people are unfamiliar with the intricacies of digital imaging technology.

Digital image resolution: Digital image resolution is commonly defined as the number of electronic samples (dots or pixels) per linear unit measure in the vertical and horizontal scanning directions. The term pixel refers to (picture elements). A digital image is analogous to an electronic photograph. It consists of a series of pixels that can be reassembled in the proper sequence to reconstruct the original page. These pixels are represented in computer memory by a digital code. Most image scanners commercially available range in resolution from 200 to 600 dpi and are referred to as bitonal or binary scanners because the pixels can only be represented as either black (0) or white (1). If the scanner captures greyscale pixels, then the quality of any continuous tones or halftones on the page will be more accurately captured. Greyscale pixels reflect the value of the light being reflected off the page and, for 8 bit pixels, are represented by a number on a scale between pure black (0) to very white (256). The number (i.e., density) of dots is governed by the resolution of the digital image scanner. The higher the resolution, the higher the fidelity of this recreated representation.

Because these digital dots (pixels) are very small, a great deal of them are required to recreate the image. For example, at a resolution of 300 dpi, 90,000 dots per square inch are generated. This is why large amounts of storage space are required to store high-quality image data.

For this paper we've defined various levels of resolution referred to as follows:

Digital imaging is not resolution-indifferent: As resolution increases so does the amount of data captured. The time required to scan and process the image, the quality, fidelity, and amount of storage space required to store the image also increase in direct proportion to increasing resolution. System resolution objectives must be examined in depth during systems design. Design trade-offs involving quality versus cost will influence every decision regarding resolution. For a detailed explanation of resolution issues, see Appendix A. It is important to determine exactly what the system's objective is so the system designer can determine the minimum economical resolution that completely satisfies the quality objectives. The idea is to maximize quality while minimizing cost.

The Trade-offs in Selecting One Technology Over the Other

A film-only system: The trade-offs involved in implementing an all-film preservation system at this time are: a) the film produced must be of the highest quality balancing high-contrast text with a wide range of greytones; and, b) typically, in film systems, very little attention is paid to indexing and creating automated retrieval capabilities; therefore, if the film is ever converted to digital, the access methods will have to be created at that time.

Designing a preservation system based on micrographics technology alone requires that all standards for the creation, handling, processing, and storage of the film be scrupulously followed. Also, it's important that the film created be of very high quality with a good balance of high and low-contrast content. However, indexing the film the way a typical digital collection would be indexed will most likely not be done. Of course, the individual publication or document can be identified along with the film roll or fiche on which it is contained, but it is extremely difficult to identify articles, pages, or the relationship between the two in a film-based system. Film indexing is just something not usually done because film access is usually sequential.

The choice is to live with the inefficient retrieval characteristics and low-quality blowbacks (printouts from a reader/printer) that are inherent disadvantages of film or to add digital retrieval at a later date. This can be done; however, the newly created digital page images will have to be further indexed to take full advantage of the digital image retrieval capabilities. This means a duplication of some of the document handling work done earlier when the film was first captured, but this incremental cost must be paid in order to enhance access.

A digital-image-only system: The trade-offs involved in implementing an all-digital preservation system at this time are: a) the designer might try to economize on the system by designing to a lower resolution, thus reducing implementation and operating costs at the expense of capturing a less-than-archival image; b) the operating budget may not include the cost of rewriting the optical disks; and c) all the quality and technical issues necessary to implement an archival digital image system have not yet been resolved.

A preservation system designed around only digital image technology must be configured to solve three major problems: l) the lack of a true archival storage capability, 2) the need to scan at high resolution (around 600 dpi or higher with greyscale), to create an archival quality image, and 3) the high but declining cost of archival resolution image storage on optical disc. The fact that digital imaging is not resolution-indifferent means that the cost of image storage will be high. For example, to store archival-quality pages on optical disc using JPEG* requires approximately 2.25 megabytes (MB) of storage space (see Appendix A, "Greyscale scanners").

With the average 12-inch optical disc costing about $300 (in quantities), and having a storage capacity of about four gigabytes (a GB is 1,000 MB); then, 3,540 greyscale 9 X 5 inch images at a resolution of 600 dpi can be stored at a cost of $0.085 per compressed page (media cost only). This same resolution image can be stored on film for less than $0.01 per page. In addition to the higher initial storage cost, the designer will have to figure in the cost of rewriting the disks every five to ten years. This rewriting cost may well be offset by the increase in storage capacity or decrease in technology cost over time.

Thomas Bourke, a well-known researcher in the area of applying micrographics and optical disc technology in libraries, in an article entitled "Research Libraries Reassess Document Preservation Technologies," notes that the Committee on Preservation of the National Archives and Records Administration made a recommendation to the Archivist that all holdings within the Archives be preserved on human-readable film, because this mature technology will not change significantly in the future.[7]

It seems that the Archives committee has concluded, as have many experts, that today an all-digital system is still a slightly risky preservation approach. But within the near future, technology will evolve; and the policy, standards, and administrative issues will be resolved, with one likely outcome being that the hybrid preservation system would become the accepted preservation approach.

The Benefits of a Hybrid-System Approach

Playing to their strengths: The requirements of a preservation system are best met with a combination of technologies. Digital imaging has two primary strengths: l) The capability to improve access, transmission, and distribution of preserved images; and 2) The ability to electronically enhance (clean up) images. It eliminates some drawbacks that have kept micrographics from being a more widely accepted document storage and retrieval technology, instead of simply a space-saving technology.[8][9]

Micrographics, on the other hand, is currently the only truly archival preservation media. It is excellent for providing long-term storage for massive amounts of infrequently used information. See Figure 5.

By taking advantage of the strengths of film combined in a hierarchical system with the access capabilities provided by digital imaging, a preservation system can be designed that will satisfy all known requirements in the most economical manner.

The hybrid end-user access system: In addition to the hybrid system designed to preserve the materials, there must also be hybrid systems that will allow access to the preserved collections. These systems could be both local and remote and will most likely be connected together via local or wide area networks. They should consist of file servers and end-user workstations.

The file servers provide access to both bibliographic catalogs that can be searched to determine where to locate items of interest and image databases containing images of the preserved documents.

The workstation (either a UNIX type system, or high-end PC 386 - 486) should be a key component in the design of any digital image preservation system. The design should focus on a distributed system based on the client/server model, where the workstations do the bulk of the work. The workstation should be used as the production engine or an end-user access station. If the system is designed in this manner then advances in workstation technology represent potential for tremendous operating efficiencies obtained by simply upgrading to the next generation of workstation processor. The benefit of doing this is that the systems designer can depend on the fact that the workstation will increase in power at the rate of about 25 percent per year, and the cost will decrease at the rate of 10 to 20 percent per year. Therefore, the price performance ratio of the entire preservation system gets better every year--automatically.

The production workstations would be connected to the preservation system via a local area network. They are used to perform the preservation functions such as batching, scanning, indexing, controlling the creation of digital film, etc.; all of the functions required to archive the documents.

On the other hand, the end-user access workstation will allow researchers to gain access to the databases of preserved documents. The system provides access to text, digital image, and multi-media databases distributed on CD-ROM, multi-media databases of images on videodisc, online networks (such as BRS, Dialog, and EPIC) as well as a document ordering capability, facsimile document delivery, and computer-assisted film retrieval. Access to one or more preservation databases--online or on CD-ROM--will allow the user to find citations to content of interest and request facsimile printouts on a local high-quality binary printer. See Figure 6. In this manner the end-user system can be useful regardless of the storage media or the technology used to preserve the materials. Where copies of documents will suffice, they can be delivered in fax format within hours of the request. Researchers will save considerable time and money by not having to travel to where the preserved materials are located, thus eliminating hardships for the researcher and artificial barriers to access.

Film First then Convert...or Vice Versa?

Filming first: Within the hybrid system concept, if an institution chooses to create film as the first step in the preservation process, the system designer can choose either low or high-contrast film based on the type of material being processed and optimize the chemistry accordingly. With film there is little flexibility for handling pages differently based on content unless multiple (low and high-contrast) exposures are used for each page, or unless, through some special processing and/or chemistry, the tonal range of the film can be extended. Typically, with low-contrast film some resolution and text clarity will be sacrificed. On the other hand, high-contrast film means better text rendering with fewer grey levels. Micrographics is basically a high-contrast process.

Many experts recommend filming first and then scanning the film. Their theory is that since the light shines through the film being scanned, most of the light can be captured by the CCD (charge-coupled device)* scanning array, and a better image created. In hardcopy scanning, the light reflects off the page in various directions and only some portion of it is captured by the CCD array. Although more light might be captured while scanning film, this advantage is offset by the fact that the film is already a generation away from the hard copy original and has lost some of its original resolution and greylevels. Therefore, image quality is probably about the same, regardless of whether the image is scanned from hardcopy or film.

Glen Magnell, director of marketing for the Document Imaging Systems Division of the Minolta Corporation, claims that microfilm is the most efficient input medium for recording onto optical disc. Magnell says that "...scanning from microfilm is much more efficient and virtually as reliable as hardcopy scanning [emphasis added]."[10]

I would disagree. Filming first works well if the documents require little processing in conjunction with the capture process. That's because film is a linear medium, so it can only be used by one person or process at one time. When filming, the hardcopy must be processed in the exact order as it should appear on the film, and QC is only performed after the film is developed. The filming process requires a good deal of batching, rework, and splicing which makes it quite inefficient.

On the other hand, when hardcopy is converted to digital form it is extremely easy to process the page (e.g., indexing, real-time QC (quality control), OCR, sorting, batching and parallel processing); all are inherent in the technology.

A second concern is the limited number of microfilm scanners available and their limited resolution options. Because the demand for preservation scanning from film is small, it may be necessary for the system to have a microfilm camera custom-modified to meet the archival-resolution requirements of preservation scanning.

However. filming first. and creating digital images by selectively scanning the film seems to be the least risky current preservation option provided that appropriate attention is paid to indexing the filmed collection.

Scanning first: If the choice is to create digital images as the first step in the preservation process the key decision revolves around the scanning resolution. Scanning original documents at a yet-to-be-determined "optimal archival" resolution means creating a balance that produces image quality comparable with photographic methods while minimizing the amount of data stored.

After scanning, image enhancement techniques are applied to improve image quality and the full high-resolution greyscale image is used to create high-quality film using an electron beam or digital computer output microfilm (COM) camera. The quality of the film created is governed by the scanning resolution and amount of greyscale data captured. (See Appendix A.)

At the same time, a parallel process uses the high-resolution greyscale image generated in the image enhancement process and converts it to a high-quality reduced resolution binary image suitable for information access. This very high-resolution image on film is the archival copy. The reduced resolution image in digital form can always be recreated from the film copy for only a few cents per page. Obsolescence is not a factor.

Timing and volume, two key factors: Of major concern when implementing a scan-first archival resolution preservation system is the amount of time that will elapse between image capture and conversion to film and the daily volume of documents being preserved. If the elapsed time is more than a day or so, and the volume is significant, it would be easier and less expensive to film first and convert to digital later. The length of time the archival resolution greyscale data has to be stored on magnetic or optical disk prior to filming, and the volume of pages to be captured, greatly affects the cost of the system. The longer this elapsed time and the higher the daily volume, the more attractive the film-first option becomes.

Simultaneous scanning and filming: At the 1992 AIIM show several vendors including Bell & Howell and Kodak introduced devices that allows simultaneous scanning and filming. These devices currently have low resolution (300 dpi) and are directed at the records management market, but they have potential for the preservation market at some future date. They both employ a very gentle belt feed that could accommodate all pages that have not yet begun to turn brittle. In addition, both have the capability to scan and film both sides of the page in a single straight through pass.

It should be noted that filming and scanning simultaneously has some of the same drawbacks as filming first. The film is created in exactly the same order that it was scanned; there is no easy way to build intelligence into the film; and if pages are skewed, misfed, or of poor quality they can only be spliced-in after the fact. Scanning-first to get the page into digital form is the most flexible and efficient processing option for the future hybrid preservation system.

Digital computer output microfilm (COM) camera: As data transmission and image enhancement technologies advance, and microprocessors become faster and more powerful, it will be cost effective to create intelligent digital film that is of higher quality than photographic produced film. It is this high-quality archival resolution digital film that could be the archival storage media for future preservation systems. The cameras capable of producing this film are: the Electron Beam Recorder from a company called Image Graphics, Inc., in Shelton Ct.; and a laser beam camera from iBASE Systems Corp., in Hayward, Ca. Both manufacturers claim that their camera can produce film that is comparable to photographically produced film.

This digitally created film can be intelligently indexed with blip marks, and bar codes to provide automated, accurate, and intelligent computer-assisted retrieval of specific pages or groups of pages from the film, thus providing a significant improvement in automated film access.

The additional intelligence that could be built into the film would allow computer-assisted monitoring programs to automatically migrate preserved documents between different levels and types of hierarchical storage consisting of magnetic disk, optical disc, digital audio tape (DAT), film, or other storage media in the most cost-effective manner. This system has the potential to eliminate one of the biggest costs associated with a large film archive: the cost of retrieving film to make copies. (Currently, at a film vault, that cost ranges from $15 - $30 per reel.) And because film is used as the system archive, any risk of obsolescence is eliminated. Optical disc would be used to provide storage for the higher-use data at levels of resolution that would satisfy the end-users information requirements (most likely 300 dpi binary).

Digital technology still under development: Some technology required to implement the hybrid preservation system, as defined herein, is still under development. High-speed, sheet-fed greyscale scanners, scanners that can scan bound books, high-speed binary and greyscale film scanners, high-capacity/high-speed reliable magnetic storage (parallel disk arrays), higher capacity write-once optical disks, high-speed greyscale digital COM cameras, and communications apparatus that can handle transmission rates of about 20 MB per second are all either unavailable or just becoming available. However, since digital imaging technology is in its infancy, these solutions will evolve rapidly. In fact, all will likely be available commercially within the next year or two.

Options for Converting from One Format to the Other

Hybrid systems must be designed to interface with past, present, and future technologies.[11] Although the design must anticipate these capabilities, operationally, the conversion can actually be accomplished through a preservation filming service bureau.

The migration path from past to present must allow preexisting microform (fiche and film) collections to be scanned and converted to a high-quality digital image format to improve access. This conversion process can take place almost automatically, depending upon the amount of intelligence built into the system and the film. It's simply a matter of mounting the right reel of film, spinning down to the correct frame and scanning the film, frame by frame. If intelligence has been built into the film during initial filming, then that intelligence can be used to index the images. The process is fast, efficient, and at a few cents a page, inexpensive. Microform scanners that support binary scanning at adequate access resolution exist. Archival resolution binary or greyscale film scanners are not yet available off the shelf, but should be in the near future. TDC, Mekel, and Photomatrix market both film and fiche scanners which provide greyscale output.

The migration path from present technologies to an older technology must allow the practitioner to create high-quality microfilm from archival resolution, greyscale digital images. This can be achieved by using a high-resolution electron beam or digital COM camera as previously mentioned. This process should be fast and efficient; but, depending on the resolution of the images, the cost of digital storage media, and the amount of time the digital data must be stored prior to creation of the film, the process may not be cheap.

The present-to-future migration path must anticipate storing not only binary and greyscale images, but also ASCII text, compound documents, audio, vector graphics*, color images, and full-motion color video. All of these formats can be represented and stored digitally. Also, in the future, it will be necessary to provide the means to store an archival copy of materials that were never in print. For any data using the page metaphor, the system remains the same. The formatted digital data is composed into pages in memory and subsequently written to film using a digital COM camera. Film is still the primary archive.

Further discussion exceeds the bounds of this paper and is due to be covered in a future paper.

ASCII Text and OCR

Extracting character code data from a page image is always an option: Technology currently available off the shelf allows pages in digital image format to be processed through an optical character reader to create ASCII text output. Certainly OCR could be helpful in automating the page creation, indexing, and abstracting process. The indexes and abstracts and/or full-text, stored in a separate database, combined with the proper automated system, could be used to gain access to the preserved information irrespective of format or storage media.

ASCII text--limited preservation usefulness: Character-coded databases are viewed as attractive because they require less storage space than image databases and are searchable. While this is indeed true, it is extremely difficult if not impossible to represent formulas, graphics, special characters, non-Roman languages, or pictorial data using just the ASCII character coded format; therefore, this technology is not directly applicable for preservation work. However, the ASCII text data could be combined with vector graphics and raster* imaging in a compound document format in order to recreate a replica of an original page thus solving the presentation problem. This would allow the researcher to search on the ASCII text, and recreate the original page with all of its graphics and halftones--the best of both worlds. However, even if the chosen compound document format can meet all of the requirements for recreating a faithful reproduction of the original page, the storage media is still the critical part of the preservation equation.

The ASCII text or compound document format would be especially beneficially for books or other materials where most, if not all, of the information content is text. (See Figure 7.) A typical printed page of text-only data contains about 3,000 to 4,000 characters. Using the ASCII character-coded data format, one can represent any character in the Roman alphabet in one byte of data. Therefore, a text-only page can be stored in 3 to 4 KB. A digital copy of the publisher's original font set might also be stored as a file appended to the set of full-text ASCII pages. Assuming the output printer can handle the font set and print raster images, it could be possible to reprint--on demand--a facsimile copy of a book that looks very much like the original. Adobe has recently announced a product they call Carousel which is a font and platform independent Postscript*.

Storing a page in a compound document format requires slightly less storage space and allows text data searching. The disadvantages are that it complicates the scanning process, sacrifices some of the editorial intelligence of the document, and requires more power at retrieval time to recreate the page. Line art or halftones on the page would be represented in scanned image format and appended to the page. Given a scanning resolution of 300 dpi binary, and assuming that 50% of an 8.5 X 11 inch page* is halftone, the appended digital image file could be as large as 253 KB*. For comparison, a 600 dpi image satisfying the above constraints will be as large as 1.05 Megabytes (4 times as large because the resolution is doubled).

Fortunately, the typical journal being considered for preservation contains few halftones. For this example, let's say that the average page contains about 15% halftone content. Using the same formula as above, with 300 dpi resolution, but substituting the 15% factor (.15 for .5), and again assuming 2 to 1 compression, we can calculate the halftone content of this compound page at 79 KB. Adding 3 KB for text data, we can calculate the compound document size for this particular page (ASCII and image) at about 82 KB.

Since experience has shown that the average size of a journal size page with 15% halftone scanned at 300 dpi binary is about 100 KB, only 12 KB more than required to store the compound document, one must weigh the tradeoffs carefully before deciding to store pages in that format.

Other data formats: Many photographs and paintings can only be represented by the original or a very high-quality image. Other graphics can be represented in image format or vector format. The intrinsic value of a document is also a significant factor in determining the appropriate format for representation. Clearly, the Declaration of Independence, the Magna Carta, or the original Gutenberg Bible cannot be replaced by ASCII-coded data, but in image format they could retain much of their intrinsic value. Of course, for the researcher needing to see how the papers contained in these documents have aged, there is no substitute for the original.[12]

Image Access, Distribution, and Transmission

Access: The system should be structured to satisfy the users' information access needs while minimizing movement of large image files. Dedicated CD-ROMs could provide access to facsimiles of very high-use preserved documents in image format. Local collections of less frequently used documents could be stored in CD-ROM jukebox servers on local area networks (LANs). Film stored in small computer-assisted retrieval (CAR) systems could provide access to the least frequently used preservation materials. It is reasonable to assume that copies of other preserved documents would be stored in a similar way at other institutions or at a central site.[13]

A user might be able to search any number of bibliographic catalogues from the desktop to identify specific materials that meet his/her research criteria. Making this database accessible over the Internet or some other network would allow widespread automated access to these treasures. The researcher could search for topics of interest or browse the image database(s) at the document structure or page level.[14]

Distribution: An average of 7,500 300-dpi compressed binary journal size page images fit on a single CD-ROM. This is equivalent to 50 books or 7.5 years of a journal publication. With production costs of about $0.50 per binary page image at adequate access resolution (including indexing and abstracting), mastering costs of $1,500, and unit costs of $2.00 per disc for 100 replications, [15] one can distribute the disc to 100 locations at a manufacturing cost of about $50.00 per copy. In the future preservation system, even if film is the archival media of choice, document images on CD-ROM discs could be the access and distribution vehicle.

When a request is received for a less frequently accessed document stored only on intelligent film, the film could be automatically located, advanced to the proper frame, scanned to create a digital image, and the image transmitted back to the requester. The digital copy would then be stored on optical disc. Subsequent requests for that publication could be serviced from the digital copy on optical disc. Once the document is stored on digital media it should remain there for some period of time (defined by the institution). If during that time, the document is not accessed then it is erased. Any future request for the document will be filled from the archival copy on film, and the process will repeat itself. This storage hierarchy is intelligently managed by a computer. The more frequently accessed preservation materials migrate to the faster, more expensive media, while infrequently used documents are migrated back to the slower, least expensive media.

Transmission: The National Research Educational Network (NREN) along with other commercial and non-commercial networks could allow widespread access to, and ordering and delivery of, preserved materials from various archives. Fax-delivered copies of preserved documents, could be ordered from other institutions or some central source. The requested documents could be retrieved, and if on film, scanned and converted to digital format and then fax-delivered back to the user within hours of request. High-speed networking along with digital imaging promises to make the knowledge of the ages available at the desktop.

[Table of Contents] [Next]


[Search all CoOL documents]


URL: http://cool.conservation-us.org/byauth/willis/hybrid/issues.html
Timestamp: Sunday, 23-Nov-2008 15:20:24 PST
Retrieved: Friday, 17-Nov-2017 17:38:11 GMT