Digital preservation: What needs to be done?
Most recent news content resides in the back rooms and basements of news agencies across the country. It’s scattered through various forms of media, in all kinds of formats, and often with little organization, management or care. To make this content accessible — not just today, but for years to come — a number of tasks need to be accomplished.
A critical element might be to connect with those who have the resources and capability to help. Why? Because digital preservation is not a simple, one-time-only effort. Preparation of content for preservation can be complex. Educopia has recently published Guidelines for Digital Preservation Readiness, which provides a helpful overview of the tasks for preparation. Once ingested into a viable, sustainable system, the content requires constant maintenance, monitoring and management. As access is the whole point of digital preservation, rights agreements need to be documented and observed, Web-ready derivatives generated and replaced over time, and access provided in compliance with rights restrictions. Clearly, this is a lot of work. Where does one begin?
Six steps to preservation and access
1. Take Inventory
When faced with born digital content that needs to be preserved, the first step is to take stock: What is located where, in what condition, on what media, and in what formats? Categorizing the types of media is often the first step. Have video content, but don’t know what kind it is, or how endangered? The National Center for Preservation Technology and Training maintains a list of images and video formats from 1956 to 1995, which is ranked by danger of obsolescence.
Minimally, a good inventory will include date stamps and versions as well as location, size, format and identification. Once the different types of media are identified and grouped, it might be necessary to search for specialized hardware and software that will allow access. Ideally, those performing the inventory will use an established inventory workstation equipped with write-blockers to prevent modification of the originals, and they will assign unique identifiers and collect checksums for each file. A checksum is a cryptographically created string of characters unique to every file. This “fixity check” can be used to guarantee that a file has never been modified and thus verify that what has been preserved is actually the original file. Because digital content is so easy to modify, this becomes critical in digital preservation. Otherwise, history could be altered!
2. Collect Metadata
As the content is accessed, this is an excellent time to gather more information about it and assist in management, organization of files and continued accessibility. For example, there might be multiple versions of a particular issue or videos related to textual content, where relationships among files need to be clarified and retained. This information is called “metadata,” and there are several different types: descriptive, structural, technical, administrative and rights. An important focus at this point for news content is copyright restrictions; are they documented? Are there additional rights restrictions, such as content submitted by readers or permissions from those photographed or interviewed? What is the context necessary to understand this material? External documentation of the information necessary to manage, access and use the content is essential to digital preservation. This information should include a unique identifier assigned to each item, and a method of linking this information to the object is crucial.
The current international standard for preservation metadata is called PREMIS; this encompasses multiple types of metadata and specifies minimum and desirable levels of information to be captured and stored for digital preservation. For example, has the preservationist been granted the right to make copies, generate derivatives, extract and assign metadata, and migrate to newer formats over time? These steps are necessary for preservation, and all such interactions must be documented throughout the life of the object. If the object is modified or changed, such as during migration to a newer format for continued access, it then becomes a new object with metadata of its own.
Different formats require different technical metadata, for which PREMIS is simply a starting point. For example, technical metadata for image files would include the color profile, the sampling rate, and the software used to create it. It’s helpful at this point to consult with preservationists in cultural heritage organizations who are already familiar with preserving the type of content in question.
3. Normalize formats
If the file format or media are no longer current, now is an excellent time to both capture the original and migrate it to current archival formats to ensure current and future access. Uncompressed file formats, which are widely supported and preferably not proprietary, are preferred for archival purposes. The Library of Congress maintains a useful list of various digital formats and their levels of sustainability. Selecting the most viable formats and migrating content to those few formats is called “normalization,” a process that reduces management overhead by simplifying the number of types of files needed to monitor and migrate over time. Documents containing multiple files will typically be normalized into archival formats for each type of file included. For example, a PDF might include transcription text for search and retrieval, images and possibly embedded software. Best practices recommend that the original document be retained unchanged as well.
A partnership in Illinois called Preserving (Digital) Objects With Restricted Resources (POWRR) has developed a helpful tool grid to assist in selection of software for various preservation functions and has just released a white paper comparing several current systems available. Below is a diagram of the categories in which they test various functionalities.
The Library of Congress also maintains a list of tools used within the digital preservation community.
4. Organize Content
Once formats are normalized and as metadata is collected, organization of the content becomes key. File and folder conventions should be examined and normalized and meet restrictions of all current operating systems, as storage media will likely change over time. A common method of packaging content and metadata together in the preservation community is using the Metadata Encoding and Transmission Standard (METS). Those with minimal access to technical resources might find the BagIt File Packaging software simpler and sufficiently effective.
5. Secure Storage
Effective, safe and managed storage of content is critical. Backups are helpful, but to ensure good backups aren’t overwritten by bad ones, checksums should be verified regularly. And backups alone are not enough. Current best practices require multiple copies to be stored in geographically distant locations to guard against regional natural disasters. Partnerships with other organizations utilizing LOCKSS (Lots Of Copies Keep Stuff Safe) networks can ensure the content remains safe at the storage level at minimal cost. LOCKSS uses a peer-to-peer backup system in which each storage box synchronizes its content against that held by its partners, and ideally all partners are in geographically distant locations.
Cloud storage options might be viable solutions if the content is kept encrypted, in sufficient locations and the fine print in the contract does not overshadow the intent of long-term storage and continued access, even in the face of institutional failure.
Business continuity must be considered. Turnover in staffing can disrupt workflows and place content at risk. Loss of funding, mergers, takeovers and bankruptcy can effectively wipe out all future access to stored content if it is not held in a partnership arrangement. History should not be at risk in these difficult times. True digital preservation is a community effort and requires collaboration, dedication and commitment.
6. Provide Access
The purpose of digital preservation is long-term access. Journalists, historians, scholars and other users need to be able to search and find the content they need and access it with current software and hardware. To make this possible, Web-ready derivatives need to be created from the preservation masters and descriptive metadata extracted for use in search and retrieval. Access systems must comply with copyright and other rights restrictions, and these restrictions should be reviewed regularly, as they might change over time. Content will need to be migrated among access systems over time, and new derivatives generated as software and hardware continue to change.
Digital preservation requires planning and continual management, and it cannot succeed in a vacuum. Much helpful information and several free or low-cost tools are on the Internet. Expertise is growing as more people become aware of the critical impact of the loss of valuable digital content. Schools have begun offering training in digital preservation, particularly to librarians and archivists. Effective systems and collaborations are in motion and more continue to develop. News publishers need to be involved in these collaborations if the historic record of our times is to survive.
Comments