Friday 24 January 2014

Robots.txt

Robots.txt is a simple file with the components used to specify the pages on a website that must be crawled or not  by Search Engine Bots. This file should be placed in the root directory of your site.
It stops web pages and contents from being indexed and shown in search results.

If you list certain pages or files under a robots.txt file but the URL to these pages  are present  in external resources, search engine bots may still crawl and index this external URL and show the page in search results.
Also, not all robots follow the instructions given in robots.txt files, so some bots may crawl and index pages mentioned under a robots.txt file .  If you want an extra indexing block, a robots Meta tag with a ‘noindex’ &’nofollow’ value in the content attribute will serve as such when used on these specific web pages, as shown below:
  <meta name=“robots” content=“noindex,nofollow”>

This file is useful in the following terms.
It protects private content.
It guarantees no duplicate content indexing.
It guarantees the blocking of all robots.
It has importance for on-page SEO
To discourage crawlers from visiting private folders.
To keep the robots from crawling less noteworthy content on a website.
To allow only specific bots access to crawl your site. This saves bandwidth.
To provide bots with the location of your Sitemap.  To do this, enter a directive in your robots.txt that includes the location of your Sitemap:
Example:      Sitemap: http://site.com/sitemaplocation.xml

There are two major elements in a robots.txt file:
(A)User-agent
(B)Disallow.

User-agent: The user-agent is most often represented with a wildcard (*) which is an asterisk sign that signifies that the blocking instructions are for all bots.

Disallow: When disallow has nothing specified it means that the bots can crawl all the pages on a site. To block a certain page you must use only one URL prefix per disallow. You cannot include multiple folders or URL prefixes under the disallow element in robots.txt.

Examples:
(1)To allow all bots to access the whole site:

User-agent:*
 Disallow:

(2) To block the entire server from the bots:

User-agent:*
 Disallow: /

(3)To allow a Google robot and disallow other robots:

User-agent: Googlebot
 Disallow:
User-agent: *
 Disallow: /

(4)To block the site from a single robot:

User-agent: Name of bot
 Disallow: /

(5)To block some parts of the site:

User-agent: *
  Disallow: /directory name/

(6)To block bots from a specific file:

User-agent: *
 Disallow: /directory/file.html

(7) To block URLs containing specific query strings. In this case, any URL containing a question mark (?) is blocked:

User-agent: *
 Disallow: /*?

(8) When you use a forward slash after a directory or a folder, it means that robots.txt will block the directory or folder and everything in it, as shown below:
Disallow: / directory Name/

Notes:

  • Search Engine bots request robots.txt files by default. If they do not find then they will report a 404 error. To avoid this you must at least use a default robots.txt, i.e. a blank robots.txt file.


  • Another important thing to is you must not include any URL that is blocked in your robots.txt file in your XML sitemap