PDF Extraction

From eotcd

Jump to: navigation, search

Contents

PDF Stats

  • 10,318,073 application/pdf with a status of 200 lines in the main cdx
  • 4,544,465 unique pdfs based on content hashes

Index Ideas

  • Primary Color of the image
  • Portrait, Landscape, Square from the first page
  • Relative size of the first page (letter, legal, oversized)
  • NER data as facets
  • Creation Date - DATESTRING
  • Creation Year - INT
  • PDF type
  • Number of Pages - INT
  • Optimized - Boolean
  • PDF Size - INT
  • inlink domains
  • foundOn domains
  • language - STR
  • encrypted - Boolean
  • pdfinfo fileds
  • inlink domain number
  • foundOn domain number
  • wordCount? - INT
  • charCount? - INT
  • Percent Numerical - Float

Workflow

Extract PDF lines from master cdx file

We are using the string " application/pdf " to extract the pdf content from the eot-complete.cdx file.

We use the following code saved as mapper.py for this extraction.

#!/usr/bin/env python

import sys

# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    if line.lower().find(" application/pdf 200") != -1:
        print line

This is executed with the following command

cat eot-complete.cdx | python mapper.py > eot-pdf.cdx

Calculate unique pdfs in dataset

Based on the hash in each line of the cdx file sort then group all like hashes.

cut -d " " -f 6 eot-pdf.cdx | sort | uniq | wc -l

Append Hash

Append the hash from each line onto the beginning of the line and then sort by that hash, this is the input for the create_wget.py script listed next.


#!/usr/bin/env python
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line_parts = line.split(" ")
    print "%s %s" % (line_parts[5], line.strip())


Create Wgets for each pdf hash

In the create_wget.py only output one example URL for each hash. This will be used for harvesting Note. Set retires to 3


#!/usr/bin/env python
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line_parts = line.split(" ")
    print "wget -tries=3 -O %s/%s.pdf \"http://webarchive.library.unt.edu/eot2008/%s/%s\" " % (line_parts[0], line_parts[0], line_parts[2], line_parts[3])

Process each pdf

Process each pdf folder with the following bash script.

Assumes the installation of xpdf-lib, imagemagick and stanford's NER

for i in *; do
  pdftotext $i/$i.pdf
  pdfinfo $i/$i.pdf > $i/$i.meta
  convert -colorspace RGB -depth 8 -density 300 -quality 80 $i/$i.pdf[0] $i/$i.jpg
  convert -strip -scale 256x-1 -quality 80 $i/$i.jpg $i/$i.thumbnail.jpg
  bash /home/mep0037/stanford-ner-2012-07-09/ner.sh $i/$i.txt > $i/$i.ner 
  
  echo $i finished
done


Solr URL Examples

Stats

Facets