Available Data

From eotcd

Jump to: navigation, search

Available Data About EOTCD Content

CDX Files

 Sample line:
   mms.gov/awards/assets/photos/verret.jpg 20100526135234 http://www.mms.gov/awards/Assets/photos/verret.jpg image/jpeg 200 3DSDSJDXOPUD7AFG2I6BA6Y5R2VXONRX - 930684 UNT-20100526134208-00298-libharvest1.warc.gz

The fields are as follows:

  • canonicalized URL: mms.gov/awards/assets/photos/verret.jpg
  • date: 20100526135234
  • mimetype: image/jpeg
  • HTTP status code: 200
  • checksum: 3DSDSJDXOPUD7AFG2I6BA6Y5R2VXONRX
  • redirect: - (this isn't a 3XX redirect response, so there is no value in the example)
  • (w)arc file offset: 930684
  • name of file containing record: UNT-20100526134208-00298-libharvest1.warc.gz


Data Across The Collection

  • cdx files processed: 125713
  • total URIs processed: 160211356
  • invalid URIs processed: 785
  • total characters in uris: 16411632088
  • unique subdomains processed: 141076
  • unique domains processed: 226
  • other numbers:
    • domain distribution across collection
    • mimetype distribution across collection
    • HTTP status code distribution across collection
    • number of links between two second level subdomains
    • number of links between two full level subdomains
    • number of URLs per subdomain
    • mimetype distribution per subdomain
    • HTTP status code distribution per subdomain
    • sizes in bytes per mimetype at subdomain level
    • bytes per relevant gov and mil second level subdomains