In the Library’s Web Archives: 1,000 U.S. Government PowerPoint Slide Decks

The Digital Content Management section has been working to extract and make available sets of files from the Library’s significant Web Archives holdings. The outcome of the project is a series of web archive file datasets, each containing 1,000 files of related media types selected from .gov domains. You can read more about this series here.

PowerPoint presentations have become a nearly ubiquitous form of communication document in the digital era. At the most basic level, PowerPoint files present a sequence of slides containing text, images and multimedia. Today, we are excited to share out a dataset of 1,000 random slide decks from U.S. government websites, collected via the Library of Congress Web Archive, such as the presentation on transporting hazardous materials in Figure 1. You can download a CSV file of data about the files, you can learn more about the dataset from this README, and you can also download the entire 3.7 GB dataset of the actual files.

Understanding the 1,000 U.S. Government Slide Decks

The dataset contains 1,000 purported PowerPoint files residing on the .gov United States government domain, randomly selected from the Library of Congress Web Archive. More specifically, it includes 1,000 files which asserted that they were associated with PowerPoint in their Media Type. Nearly all of these are .ppt files. Of note, newer PowerPoint files that use the extension .pptx use a different Media Type and as a result there are only 11 files in the corpus that end in .pptx. As part of our analysis and creation of this dataset, we ran each file through Apache Tika and were able to collect additional metadata about the dataset. For example, we discovered that the dataset contains 22,542 individual slides and 1,340,722 individual words by aggregating the slide count and word count fields from the metadata CSV. The words may appear on the individual slides themselves or in the notes field associated with an individual slide. The README for this dataset contains more information about these and all the fields included in the metadata CSV.  Some files in the dataset did not report a slide count or a word count and as such, were not included in the aggregate numbers mentioned above.

The data suggests that, on average, these slide decks are 22 slides long and contain 1,340 words each. As the scatter plot below illustrates, a small number of outliers significantly skew the number of detected words and number of slides.

Figure 2: Scatterplot of numbers of words and number of slides in files in the dataset.

The outliers in Figure 2 demonstrate the varied ways that PowerPoint is used for government publishing. For example, consider the furthest outlier in regards to number of slides detected and number of words detected: 288 and 29,939, respectively. The length of the deck and extensive text notes included with the slides in this employee training guide power point from the state of Washington feels more like a book than a presentation. Similarly, this slide deck from the U.S. Department of Transportation on transporting hazardous materials contains 147 slides and 7,693 words.

U.S. Government Slide Decks Over Time

Files in this dataset were captured between 1997 and 2017. It is important to note, however, that this can vary from the creation date field, which was derived through Apache Tika. For example, the earliest creation date found in the dataset is for a 1994 slide deck on a leadership program from NASA. However, it was not captured in the web archives until six years later, in 2000.

Figure 3 illustrates the gap between the original creation date of the files and the capture date and accentuates the necessity of understanding the data, provenance of the data, and the nuances with its metadata. Further analysis in this arena would be fascinating, and we encourage you to dive in and let us know what you find!

Figure 3 Total numbers of files in the dataset by year captured compared to purported year created.

What Will You Do With 1,000 U.S. Government Slide Decks?

We are curious for the ways that you might explore and use this set of slide decks. Even from this initial exploration, it is clear that these varied resources have become important parts of the way the government is communicating and publishing.

3 Comments

  1. DrWeb
    November 22, 2019 at 2:31 pm

    Valiant, interesting effort. However, not being able to see the contents or what slides from what agency are a problem. Maybe another time through the contents would help. I’d rather not download a huge zip file, just to see what might be there, IMHO.

  2. Trevor Owens
    November 25, 2019 at 8:25 am

    Hi DrWeb, Happy to help! You can download just the CSV of metadata about the 1,000 files here. From the CSV, you can also get links to the individual files, so you can download only the ones you are interested in. The metadata has all of the original URLs for the files, so you can see which agency websites they are each from.

  3. Euan Cochrane
    November 26, 2019 at 2:26 pm

    Hi team,

    This is really wonderful and a great resource for the community. Thank you!

    Would it be possible to identify other types of presentation files? Do you have any other metadata that could help to do so other than “media type”?

    If it’s any use here is a query in Wikidata that includes many more types of formats that are created by presentation software applications:
    https://w.wiki/CpV

    I will also see about getting this added to the test corpora section on https://digipres.org

    http://www.digipres.org/#test-corpora

    Thanks again!

    Euan

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.