Finding By the People Transcriptions in the Library’s Digital Collections

Today’s guest post is from Dr. Victoria Van Hyning, who served as a By the People Community Manager at the Library from 2018-2020. Starting in Fall 2020, she will be an Assistant Professor of Library Innovation at the University of Maryland iSchool, where she will continue her research on crowdsourcing, outreach, and inclusion.

The Library of Congress launched the By the People (crowd.loc.gov) crowdsourcing project in October 2018. The project invites anyone with an internet connection to transcribe, review, and tag digitized images of manuscripts and typed materials from the Library’s collections. Everyone is welcome to take part! Volunteers don’t even need to create an account, but those who do have access to additional features such as tagging, and reviewing other people’s transcriptions.

All transcriptions are created and reviewed by volunteers before they are made available on loc.gov, the Library’s main website and discovery layer. These transcriptions improve search, readability, and access to handwritten and typed documents for those cannot read the handwriting of the original documents, or who use screen readers.

The By the People team works with a range of technical and curatorial staff across the Library to bring image files and metadata from loc.gov to crowd.loc.gov (where the materials are transcribed and reviewed) and to bring the resulting transcriptions back to loc.gov (where they improve search and access).

By following this link, you’ll get to the list of all By the People transcriptions that have been published on loc.gov. The search results for this link are updated automatically over time as new content is added. As of July 2020, we’ve published 16,315 transcriptions on the Library’s main website for the following collections:

Branch Rickey baseball scouting reports
Mary Church Terrell
Abraham Lincoln
Clara Barton
Susan B. Anthony
William Oland Bourne and the disabled Civil War soldiers

Publish early, acknowledge often

Image: Abraham Lincoln papers, Series 4. Addenda, 1774-1948: Cigar label, “El Biejo Onesto Abe Cigarros”, 1860.

By the People is designed as a stand-alone website, meaning it is not directly tied to the Library’s main website. It was created here at the Library in 2018, when the LC Labs team joined with Library Services and the Platform Services Division of the Office of the Chief Information Officer to develop the crowdsourcing initiative and its software platform, Concordia. By the People built on an earlier experiment launched by LC Labs in 2017, called Beyond Words, as well as other crowdsourcing investigations. By the People has now moved from experiment to flourishing program, hosted by the Library’s Digital Content Management Section. The underlying code from Concordia is freely available for use and reuse via the Library of Congress’s GitHub.

Changes made to images, metadata or transcriptions on one site are not automatically reflected on the other. This means we need to reintegrate the transcriptions into the Library’s digital collections access systems in bulk. The Library was keen to demonstrate our ability to bring transcriptions back from By the People to the Library’s main website, so just three months after the project launched we published our first batch of crowdsourced transcriptions, consisting of 781 transcriptions of pages in the Abraham Lincoln papers.

We were also committed to prominently acknowledging the work of volunteers on this project. An attribution is included in each searchable, downloadable .txt file, and an overlay appears over the transcription viewer on loc.gov stating “Transcribed and reviewed by volunteers participating in the By the People project at crowd.loc.gov.”

Testing, testing

These 781 Lincoln pages were a pilot for our process and taught us many valuable things. The first was that it was best only to bring back completed items, (i.e., a whole diary or letter, rather than a hodgepodge of completed pages from within an item that is still being transcribed and reviewed). Although an argument could be made that more searchable text more quickly would optimize research and access, it turned out that researchers and our volunteers were confused when they only saw transcriptions for a handful of pages within a larger object. Therefore, we adjusted our process on the Concordia application only to export completed items.

Other iterations involved changing how we display transcriptions from other sources in the Lincoln papers. A new overlay was applied to some 10,000 transcribed pages in the Lincoln papers, created years ago by members of Knox College, thus bringing greater prominence to that collaboration. In early discussions about how to include and phrase attributions, senior staff reflected that this overlay might eventually enable the Library to indicate the provenance of other kinds of text, including Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR).

Whole datasets

In March 2019, we released our second data ingest, featuring the entirety of the Branch Rickey baseball scouting reports, which volunteers had transcribed and reviewed in just four months. Five weeks after the Campaign was completed, these 1,926 pages were published on loc.gov to celebrate Opening Day, while the bulk data was made available in .csv from, along with a README file and some preliminary analysis on labs.loc.gov.

Later this year, this version of the data, along with an updated version (v2) will be released as a government report with their own catalog record, titled “Datasets from Branch Rickey scouting reports.” Additional datasets will be added to the catalog as campaigns are completed. Examples of completed Campaigns include left-handed penmanship entries to a competition run by preacher and publisher William Oland Bourne for disabled Union veterans of the Civil War; the letters, autobiographical fragments, protest coordination notebook and other documents from Rosa Parks’s papers, and the diaries, speeches and letters of suffragists including Susan B. Anthony, and Carrie Chapman Catt.

These whole campaign datasets might interest researchers of machine learning, linguistics, history, sociology, and other domains.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.