Request for Information - Web Harvesting Service at the Library of Congress

Home
News
Vacancies
Request for Information - Web Harvesting Service at the Library of Congress

10 July 2020

Washington, D.C.

This is a Request for Information (RFI) to obtain capability statements and potential feedback for an upcoming Library of Congress (LOC) requirement. The solicitation is not expected until quarter 1 or quarter 2 of fiscal year 2021.

Specifically, the Library of Congress (Library) requires contract support to enable the systematic harvesting of content from the web based on instructions from Library staff, to provide temporary access to the content and required crawl reports for quality review, and to enable transfer of content to the Library for preservation and public access. Since 2000, the Library has collected and preserved harvested web content related to a variety of thematic web and event-based topics, such as the United States National Elections, Public Policy Topics, Congressional and Legislative Branch topics, and Web Comics. The web archives are an important component of the Library’s born digital collections. The harvesting of selected websites for the Library’s collections supports the Library’s strategic goal to acquire, preserve, and provide access to a universal collection of knowledge and the record of America’s creativity. Web harvesting services will result in the capture of web content to be added to the Library of Congress digital collections.

The scale and variability of collecting required is notable. The Library typically requires thousands of seeds crawled at varying frequencies (typically twice-daily for RSS feed content, and weekly, monthly, quarterly, twice-yearly, and yearly for other types of content) at any given time. Content targeted for archiving can be single documents (such as PDFs), entire websites (such as State.gov), specific domains or portions of websites focused on a particular topic, RSS feeds, or a variety of social media content (such as public Facebook pages for organizations or people, specific Twitter accounts, or entire YouTube channels). Content targeted for archiving is published in the United States and in multitude of other countries, with varying languages.

The Library’s web archives follow a permissions-based model, so careful curation occurs, and scoping instructions are provided to the Contractor by the Library with each seed list to ensure that content is captured by the contractor in accordance with the Library’s permissions approach.

FIND OUT MORE