PDF Extraction
From eotcd
Contents |
PDF Stats
- 10,318,073 application/pdf with a status of 200 lines in the main cdx
- 4,544,465 unique pdfs based on content hashes
Index Ideas
- Primary Color of the image
- Portrait, Landscape, Square from the first page
- Relative size of the first page (letter, legal, oversized)
- NER data as facets
- Creation Date - DATESTRING
- Creation Year - INT
- PDF type
- Number of Pages - INT
- Optimized - Boolean
- PDF Size - INT
- inlink domains
- foundOn domains
- language - STR
- encrypted - Boolean
- pdfinfo fileds
- inlink domain number
- foundOn domain number
- wordCount? - INT
- charCount? - INT
- Percent Numerical - Float
Workflow
Extract PDF lines from master cdx file
We are using the string " application/pdf " to extract the pdf content from the eot-complete.cdx file.
We use the following code saved as mapper.py for this extraction.
#!/usr/bin/env python import sys # input comes from STDIN (standard input) for line in sys.stdin: # remove leading and trailing whitespace line = line.strip() if line.lower().find(" application/pdf 200") != -1: print line
This is executed with the following command
cat eot-complete.cdx | python mapper.py > eot-pdf.cdx
Calculate unique pdfs in dataset
Based on the hash in each line of the cdx file sort then group all like hashes.
cut -d " " -f 6 eot-pdf.cdx | sort | uniq | wc -l
Append Hash
Append the hash from each line onto the beginning of the line and then sort by that hash, this is the input for the create_wget.py script listed next.
#!/usr/bin/env python
import sys
# input comes from STDIN (standard input) for line in sys.stdin: # remove leading and trailing whitespace line_parts = line.split(" ") print "%s %s" % (line_parts[5], line.strip())
Create Wgets for each pdf hash
In the create_wget.py only output one example URL for each hash. This will be used for harvesting Note. Set retires to 3
#!/usr/bin/env python
import sys
# input comes from STDIN (standard input) for line in sys.stdin: # remove leading and trailing whitespace line_parts = line.split(" ") print "wget -tries=3 -O %s/%s.pdf \"http://webarchive.library.unt.edu/eot2008/%s/%s\" " % (line_parts[0], line_parts[0], line_parts[2], line_parts[3])
Process each pdf
Process each pdf folder with the following bash script.
Assumes the installation of xpdf-lib, imagemagick and stanford's NER
for i in *; do pdftotext $i/$i.pdf pdfinfo $i/$i.pdf > $i/$i.meta convert -colorspace RGB -depth 8 -density 300 -quality 80 $i/$i.pdf[0] $i/$i.jpg convert -strip -scale 256x-1 -quality 80 $i/$i.jpg $i/$i.thumbnail.jpg bash /home/mep0037/stanford-ner-2012-07-09/ner.sh $i/$i.txt > $i/$i.ner echo $i finished done
Solr URL Examples
Stats
- Page Number Stats
- Host Domain Count Stats
- Top Level Domain Count Stats
- Second Level Domain Count Stats
- URL Count Stats