Available Data
From eotcd
Available Data About EOTCD Content
CDX Files
Sample line: mms.gov/awards/assets/photos/verret.jpg 20100526135234 http://www.mms.gov/awards/Assets/photos/verret.jpg image/jpeg 200 3DSDSJDXOPUD7AFG2I6BA6Y5R2VXONRX - 930684 UNT-20100526134208-00298-libharvest1.warc.gz
The fields are as follows:
- canonicalized URL: mms.gov/awards/assets/photos/verret.jpg
- date: 20100526135234
- original URL: http://www.mms.gov/awards/Assets/photos/verret.jpg
- mimetype: image/jpeg
- HTTP status code: 200
- checksum: 3DSDSJDXOPUD7AFG2I6BA6Y5R2VXONRX
- redirect: - (this isn't a 3XX redirect response, so there is no value in the example)
- (w)arc file offset: 930684
- name of file containing record: UNT-20100526134208-00298-libharvest1.warc.gz
Data Across The Collection
- cdx files processed: 125713
- total URIs processed: 160211356
- invalid URIs processed: 785
- total characters in uris: 16411632088
- unique subdomains processed: 141076
- unique domains processed: 226
- other numbers:
- domain distribution across collection
- mimetype distribution across collection
- HTTP status code distribution across collection
- number of links between two second level subdomains
- number of links between two full level subdomains
- number of URLs per subdomain
- mimetype distribution per subdomain
- HTTP status code distribution per subdomain
- sizes in bytes per mimetype at subdomain level
- bytes per relevant gov and mil second level subdomains