Sitemaps

From Wikipedia, the free encyclopedia
Jump to: navigation, search

The Sitemaps protocol allows a webmaster to inform search engines about URLs on a website that are available for crawling. A Sitemap is an XML file that lists the URLs for a site. It allows webmasters to include additional information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs in the site. This allows search engines to crawl the site more intelligently. Sitemaps are a URL inclusion protocol and complement robots.txt, a URL exclusion protocol.

Sitemaps are particularly beneficial on websites where:

  • some areas of the website are not available through the browsable interface, or
  • webmasters use rich Ajax, Silverlight, or Flash content that is not normally processed by search engines.

The webmaster can generate a Sitemap containing all accessible URLs on the site and submit it to search engines. Since Google, Bing, Yahoo, and Ask use the same protocol now[citation needed], having a Sitemap would let the biggest search engines have the updated pages information.

Sitemaps supplement and do not replace the existing crawl-based mechanisms that search engines already use to discover URLs. Using this protocol does not guarantee that web pages will be included in search indexes, nor does it influence the way that pages are ranked in search results.[citation needed]

Contents

[edit] History

Google first introduced Sitemaps 0.84 in June 2005 so web developers could publish lists of links from across their sites. Google, MSN and Yahoo announced joint support for the Sitemaps protocol in November 2006. The schema version was changed to "Sitemap 0.90", but no other changes were made.

In April 2007, Ask.com and IBM announced support for Sitemaps. Also, Google, Yahoo, MS announced auto-discovery for sitemaps through robots.txt. In May 2007, the state governments of Arizona, California, Utah and Virginia announced they would use Sitemaps on their web sites.

The Sitemaps protocol is based on ideas[1] from "Crawler-friendly Web Servers".[2]

[edit] File format

The Sitemap Protocol format consists of XML tags. The file itself must be UTF-8 encoded. Sitemaps can also be just a plain text list of URLs. They can also be compressed in .gz format.

A sample Sitemap that contains just one URL and uses all optional tags is shown below.

<?xml version="1.0" encoding="utf-8"?>
 
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <url>
        <loc>http://example.com/</loc>
        <lastmod>2006-11-18</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.8</priority>
    </url>
</urlset>

[edit] Element definitions

The definitions for the elements are shown below[3]:

Element Required? Description
<urlset> Yes The document-level element for the Sitemap. The rest of the document after the '<?xml version>' element must be contained in this.
<url> Yes Parent element for each entry. The remaining elements are children of this.
<loc> Yes Provides the full URL of the page, including the protocol (e.g. http, https) and a trailing slash, if required by the site's hosting server. This value must be less than 2,048 characters.
<lastmod> No The date that the file was last modified, in ISO 8601 format. This can display the full date and time or, if desired, may simply be the date in the format YYYY-MM-DD.
<changefreq> No How frequently the page may change:
  • always
  • hourly
  • daily
  • weekly
  • monthly
  • yearly
  • never

'Always' is used to denote documents that change each time that they are accessed. 'Never' is used to denote archived URLs (i.e. files that will not be changed again).

This is used only as a guide for crawlers, and is not used to determine how frequently pages are indexed.

<priority> No The priority of that URL relative to other URLs on the site. This allows webmasters to suggest to crawlers which pages are considered more important.

The valid range is from 0.0 to 1.0, with 1.0 being the most important. The default value is 0.5.

Rating all pages on a site with a high priority does not affect search listings, as it is only used to suggest to the crawlers how important pages in the site are to one another.

Support for the elements that are not required can vary from one search engine to another.[3]

[edit] Sitemap index

The Sitemap XML protocol is also extended to provide a way of listing multiple Sitemaps in a 'Sitemap index' file. The maximum Sitemap size of 10 MB or 50,000 URLs means this is necessary for large sites. As the Sitemap needs to be in the same directory as the URLs listed, Sitemap indexes are also useful for websites with multiple subdomains, allowing the Sitemaps of each subdomain to be indexed using the Sitemap index file and robots.txt.

[edit] Other formats

[edit] Text file

The Sitemaps protocol allows the Sitemap to be a simple list of URLs in a text file. The file specifications of XML Sitemaps apply to text Sitemaps as well; the file must be UTF-8 encoded, and cannot be more than 10 MB large or contain more than 50,000 URLs, but can be compressed as a gzip file.[3]

[edit] Syndication feed

A syndication feed is a permitted method of submitting URLs to crawlers; this is advised mainly for sites that already have syndication feeds. One stated drawback is this method might only provide crawlers with more recently created URLs, but other URLs can still be discovered during normal crawling.[3]

[edit] Search engine submission

If Sitemaps are submitted directly to a search engine (pinged), it will return status information and any processing errors. The details involved with submission will vary with the different search engines. The location of the sitemap can also be included in the robots.txt file by adding the following line to robots.txt:

Sitemap: <sitemap_location>

The <sitemap_location> should be the complete URL to the sitemap, such as: http://www.example.org/sitemap.xml (however, see the discussion). This directive is independent of the user-agent line, so it doesn't matter where it is placed in the file. If the website has several sitemaps, this URL can simply point to the main sitemap index file.

The following table lists the sitemap submission URLs for several major search engines:

Search engine Submission URL Help page
Google http://www.google.com/webmasters/tools/ping?sitemap= Submitting a Sitemap
Yahoo! http://search.yahooapis.com/SiteExplorerService/V1/updateNotification?appid=SitemapWriter&url= Does Yahoo! support Sitemaps?
Ask.com http://submissions.ask.com/ping?sitemap= Q: Does Ask.com support sitemaps?
Bing (Live Search) http://www.bing.com/webmaster/ping.aspx?siteMap= Bing Webmaster Tools
Yandex  — Sitemaps files
Didikle (Turkish Mobile Search Engine) http://www.didikle.com/ping?sitemap= Sitemap and robots.txt info (in Turkish)

Sitemap URLs submitted using the sitemap submission URLs need to be URL-encoded, replacing : with %3A, / with %2F, etc.[3]

[edit] Sitemap limits

Sitemap files have a limit of 50,000 URLs and 10 megabytes per sitemap. Sitemaps can be compressed using gzip, reducing bandwidth consumption. Multiple sitemap files are supported, with a Sitemap index file serving as an entry point. Sitemap index files may not list more than 50,000 Sitemaps and must be no larger than 10MB (10,485,760 bytes) and can be compressed. You can have more than one Sitemap index file.[3]

As with all XML files, any data values (including URLs) must use entity escape codes for the characters ampersand (&), single quote ('), double quote ("), less than (<), and greater than (>).

[edit] See also

[edit] References

  1. ^ M.L. Nelson, J.A. Smith, del Campo, H. Van de Sompel, X. Liu (2006). "Efficient, Automated Web Resource Harvesting". WIDM'06. http://public.lanl.gov/herbertv/papers/f140-nelson.pdf. 
  2. ^ O. Brandman, J. Cho, Hector Garcia-Molina, and Narayanan Shivakumar (2000). "Crawler-friendly web servers". Proceedings of ACM SIGMETRICS Performance Evaluation Review, Volume 28, Issue 2. doi:10.1145/362883.362894. 
  3. ^ a b c d e f Official protocol for XML Sitemaps

[edit] External links

Personal tools
Namespaces
Variants
Actions
Navigation
Interaction
Toolbox
Print/export
Languages