Data Work

From eotcd

Jump to: navigation, search

Previously, we used Redis for storing and querying our CDX data for the EOTCD project. We are now moving over to MongoDB for various reasons including for indexing purposes, Python driver availability, built in map/reduce functionality, and Mark's interest in working with time series. Or maybe we just wanted to try it.

We created unique keys for records from a base-32 encoded SHA1 hash of the concatenation of the original URI, the time stamp, and the (W)ARC name.

Indexing on the original URI field does not work because of the length of those URIs. We loaded a smaller test data set (692,700+ URIs) to do some test queries of interest and to find out upon what fields we might need indexes. Currently we have indexes on _id (default), time stamp, mime type, and org.

To aggregate the various pieces of data we had per URI, we matched up sizes of downloaded objects (we calculated this from reading the (W)ARCs again) with the data from their CDX file records (our original CDX files did not store item size). Because URIs can occur more than once in an archive collection (crawled multiple times by multiple institutions), we also looked at the time stamp, the (W)ARC the URI instance came from, and the checksum. Any combination of fields that would uniquely match the two would work. We also calculated other information at the time of bringing together the sizes and CDX data, including the SURT form of the URI, the Domain SURT form for the URI, the organization the URI should be attributed to, and the top level domain. Whether it is better to store this information, or figure it out from other stored elements when it is needed, we didn't consider too thoroughly. We decided it was best to store more information now when it was easy to put it all in together, and if it becomes problematic, get rid of it later.

A document in the "uris" collection of the "cdxdatabase" in MongoDB looks like:

{
    u'redirect': u'http://webarchives.loc.gov/collections/lcwa0007/20001003073353/http://www.texasgop.org/images/mast_ele.gif',
    u'sha1': u'3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ',
    u'canuri': u'webarchives.loc.gov/collections/lcwa0007/20000922033433/http:/www.texasgop.org/images/mast_ele.gif',
    u'time': u'20090116132030',
    u'domsurt': u'http://(gov,loc,',
    u'httpstat': u'302',
    u'warc': u'LOC-ENDOFTERM-01-20090116131839-11548-crawling108.us.archive.org.arc.gz',
    u'surt': u'http://(gov,loc,webarchives,)/collections/lcwa0007/20000922033433/http:/www.texasgop.org/images/mast_ele.gif',
    u'mime': u'image/gif',
    u'origuri': u'http://webarchives.loc.gov/collections/lcwa0007/20000922033433/http://www.texasgop.org/images/mast_ele.gif',
    u'offset': u'49049023',
    u'org': u'LOC',
    u'_id': u'OI4XIO4ZOCWIDXPLM62RDGLWAYVTHF4A',
    u'topdom': u'gov',
    u'size': u'0'
}


After 160,000,000+ URI records/documents were put into our database, we created another collection for time series data. A document in the "daily" collection of the "cdxdatabase" in MongoDB looks like:

{
    u'toturis': 2612986,
    u'UNT': {
        u'count': 82133,
        u'size': 9940201206L,
        u'okcount': 80370,
        u'oksize': 9939774642L
    },
    u'CDL': {
        u'count': 2520047,
        u'size': 347930920635L,
        u'okcount': 2266322,
        u'oksize': 346843900209L
    },
    u'totsize': 358366285264L,
    u'totoksize': 357278837879L,
    u'IA': {
        u'count': 10806,
        u'size': 495163423,
        u'okcount': 10801,
        u'oksize': 495163028
    },
    u'totokcount': 2357493,
    u'_id': u'20090119'
}

This structure gives us total URIs downloaded per day and by institution, total bytes downloaded per day and by institution, and total URIs and bytes for items with http status of 2XX (OK) per day and by institution.


Deduplication (Duplication Reduction)

Deduplicated records, which ARE represented in our MongoDB "uris" collection are NOT reflected in the time series collections in MongoDB. This is because the deduplicated records don't show up with a WARC name in the CDX and thus it is not in the MongoDB document (record) either. It is from the WARC name that we were determining the crawling organization for time series data. If the "uris" document (record) had no org value, we didn't count it for the time series (or the time series visualizations). Deduplicated records are also NOT reflected in the treemap visualization below, as we did not count anything that did not have a size value greater than 0, and the deduplicated records are not directly connected to a size value.

In searching the WARC files, we found 3,860,113 revisit records (type of records written in WARCs for deduplication items) across 17,886 WARC files.

A deduplication record in the "uris" collection of the "cdxdatabase" in MongoDB looks like:

{
    u'redirect': u'-',
    u'sha1': u'3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ',
    u'canuri': u'lhncbc.nlm.nih.gov/apdb/multim/blbro01.gif',
    u'time': u'20090507080651',
    u'domsurt': u'http://(gov,nih,',
    u'httpstat': u'-',
    u'warc': u'-',
    u'surt': u'http://(gov,nih,nlm,lhncbc,)/apdb/multim/blbro01.gif',
    u'mime': u'-',
    u'origuri': u'http://www.lhncbc.nlm.nih.gov/apdb/multim/blbro01.gif',
    u'offset': u'0',
    u'org': u'-',
    u'_id': u'5R76Q3PN7R6Q4BHFECD2UP7LR7TQHCAZ',
    u'topdom': u'gov',
    u'size': u'None'
}

One thing to note, since these MongoDB documents (records) were generated from the CDX files, they don't have the WARC-Record-IDs. This makes affiliating the deduped records back to those they are duplicating not exactly possible in this form (we can find past captures of the URI with the same SHA1, but if there are multiples, we cannot be exact). However, it wouldn't be all that difficult (though perhaps a little costly) to add the WARC-Record-IDs to the database.


Visualizations

From this data we are making time series and other visualizations.

Using dygraphs, we created a visualization showing the number of gigabytes downloaded per day per institution: http://research.library.unt.edu/eotcd/visualization/timeseries/eot_timeseries_daily.html.

Here is number of URIs (in thousands) downloaded per day per institution: http://research.library.unt.edu/eotcd/visualization/timeseries/eot_timeseries_daily_uri.html.

Here is number of 2XX OK HTTP status URIs (in thousands) downloaded per day per institution: http://research.library.unt.edu/eotcd/visualization/timeseries/eot_timeseries_daily_ok.html.

Here are the three above on one page for comparison's sake: http://research.library.unt.edu/eotcd/visualization/timeseries/eot_timeseries_daily_compare.html.

Here is a treemap showing the EOT metrics of size and count that are given for our Archive Content Categories as given at Web_Archive_Service_Models_and_Metrics: http://research.library.unt.edu/eotcd/visualization/treemaps/eot_metrics_treemap.html.