Webcontent.gov - Your Guide to Managing U.S. Governement Websites

Home About Us Frequently Asked Questions     Topics A-Z  Contact Us   Jobs
Bookmark and Share


Search Engine Indexing and Robots.txt Files

What is It?

Search engine robots will check a special plain text file in the root of each server called robots.txt before indexing a site. Robots.txt implements the Robots Exclusion Protocol, which allows you as a web manager, to define what parts of your site are off-limits to search engine crawlers. For example, Web managers can disallow access to the Common Gateway Interface (CGI), or private and temporary directories, because they don’t want pages in those areas indexed.

Here is some general information about robots.txt files.

Robots.txt File

The robots exclusion standard or robots.txt protocol is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website. The information specifying the parts that should not be accessed is specified in a file called robots.txt in the top-level directory of the website. The robots.txt file is made up of two parts, the User-agent and the Disallow. The User-agent specifies which robots to allow or disallow and the Disallow specifies which directories robots can or cannot crawl. The robots.txt is a gentleman's agreement and some crawlers, such as Google, may ignore the robots.txt file that disallows all crawling.

USASearch.gov uses the MSN index, so to check if MSN has crawled your site lately, look for the msnbot in your web logs.

Example of a recommended robots.txt files blocking crawling of the cgi-bin, scripts, and images directories:

User-agent: *
Disallow: /cgi-bin/
Disallow: /scripts/
Disallow: /images/

Why It’s Important

The USASearch.gov team has received several emails from federal webmasters informing us that their websites are not being crawled by USA.gov (formerly FirstGov.gov). We have found that most of these sites have disallowed searchbots from crawling their websites. In order for your content to be included in the USASearch.gov, or any search engine, you must allow search engines to crawl your site. USASearch.gov uses the MSN Search index to provide its core results. At the very least, federal webmasters should allow MSNBot to crawl their sites so they can be included in search results for USA.gov, the official web portal of the U.S. government.

In addition, OMB's Memorandum, M-06-02, Improving Public Access to and Dissemination of Government Information and Using the Federal Enterprise Architecture Data Reference Model says: "when disseminating information to the public-at-large, publish your information directly to the Internet. This procedure exposes information to freely available and other search functions and adequately organizes and categorizes your information."

This memorandum assumes that your robots.txt file is allowing search engines to crawl your site. If you are disallowing search engine crawlers, you are not exposing information to search engines, and therefore not complying with this guidance.

Best Practices

  • Include the robots.txt file in your server's root directory. This is standard web management practice. If you have robots.txt files in various subdirectories of your site, it will block crawling of that subdirectory and any directory below.
  • Search your server for stray robots.txt files and delete any robots.txt file below the root directory.

Meta-Tag Robots Exclusion

Review your pages to make sure you are not using robots exclusion in your Meta tags if you intend for those pages to be publicly disseminated. For those who are not familiar with Meta Tag Robots Exclusion, HTML meta tags can be used to exclude robots according to the contents of a web page. Again, this is purely advisory, and also relies on the cooperation of the robot programs.

Example of a meta-tag robots exclusion:

<head><meta name="robots" content="no index, nofollow"></head>

Resources

 

Page Updated or Reviewed: July 1, 2008

Privacy Policy About Us FAQ's Topics A-Z Contact Us Jobs
USA dot Gov: The U.S. Government's Official Web Portal