Robots.txt is a simple file with the
components used to specify the pages on a website that must be crawled or
not by Search Engine Bots. This file
should be placed in the root directory of your site.
It stops web pages and contents from being
indexed and shown in search results.
If you list certain pages or files under a robots.txt file but the URL
to these pages are present in external resources, search engine bots may
still crawl and index this external URL and show the page in search results.
Also, not all robots follow the instructions
given in robots.txt files, so some bots may crawl and index pages mentioned
under a robots.txt file . If you want an
extra indexing block, a robots Meta tag with a ‘noindex’ &’nofollow’ value
in the content attribute will serve as such when used on these specific web
pages, as shown below:
<meta name=“robots”
content=“noindex,nofollow”>
This
file is useful in the following terms.
It protects private content.
It guarantees no duplicate content
indexing.
It guarantees the blocking of all robots.
It has importance for on-page SEO
To discourage crawlers from visiting
private folders.
To keep the robots from crawling less
noteworthy content on a website.
To allow only specific bots access to crawl
your site. This saves bandwidth.
To provide bots with the location of your
Sitemap. To do this, enter a directive
in your robots.txt that includes the location of your Sitemap:
Example: Sitemap: http://site.com/sitemaplocation.xml
There
are two major elements in a robots.txt file:
(A)User-agent
(B)Disallow.
User-agent: The user-agent is most often represented with a wildcard (*) which
is an asterisk sign that signifies that the blocking instructions are for all
bots.
Disallow: When disallow has nothing specified it means that the bots can
crawl all the pages on a site. To block a certain page you must use only one
URL prefix per disallow. You cannot include multiple folders or URL prefixes
under the disallow element in robots.txt.
Examples:
(1)To
allow all bots to access the whole site:
User-agent:*
Disallow:
(2) To
block the entire server from the bots:
User-agent:*
Disallow: /
(3)To
allow a Google robot and disallow other robots:
User-agent: Googlebot
Disallow:
User-agent: *
Disallow: /
(4)To
block the site from a single robot:
User-agent: Name of bot
Disallow: /
(5)To
block some parts of the site:
User-agent: *
Disallow: /directory name/
(6)To
block bots from a specific file:
User-agent: *
Disallow: /directory/file.html
(7) To
block URLs containing specific query strings. In this case, any URL containing
a question mark (?) is blocked:
User-agent: *
Disallow: /*?
(8) When
you use a forward slash after a directory or a folder, it means that robots.txt
will block the directory or folder and everything in it, as shown below:
Disallow: / directory Name/
Notes:
- Search Engine bots request robots.txt files by default. If they do not find then they will report a 404 error. To avoid this you must at least use a default robots.txt, i.e. a blank robots.txt file.
- Another important thing to is you must not include any URL that is blocked in your robots.txt file in your XML sitemap