How2Lab Logo
tech guide & how tos..


How to tell search engines not to crawl your entire website?


There are two main reasons why you need to tell the search engine robots not to crawl certain sections of your website directories and files.

Directories that store your backend programs, associated class files and javascript files serve no purpose for the search engines. Hence why expose these to them.

Besides, when a search engine robot is crawling on your website, drilling down to every directory, sub-directory and files; it is consuming bandwidth. Bandwidth is money because you pay to your hosting provider for the bandwidth that you consume. Naturally, you would like the precious bandwidth to be available more for your site visitors and less for your search engines.

Hence, the need to tell search engines what not to crawl.

Fortunately, the world wide web provides for an REP (Robots Exclusion Protocol) for communicating this to the search engines. When an REP compliant robot (most search engine robots are) visits your website, it first looks for a robots.txt file in your website home directory. If this file exists, it will read through all instructions contained in the file and accordingly decide what not to crawl. So, it is this file where you need to tell the search engines not to crawl certain sections of your website directory and file structure.


Points to Note:

  1. The robots.txt file is a single file that must reside on your website's home directory so that it can be fetched with the url http://www.yoursitename.com/robots.txt
  2. The file name must be in lowercase only.
  3. If you have a situation where different webmasters work on the same domain under different directories, you will need to co-ordinate with them to obtain a list of directories that are not to be crawled, and put the suitable Disallow instructions in your single robots.txt file. You cannot have multiple such files existing in different sub-directories, as the search engine will only read the one which is under your home directory.
  4. Do not use Allow field to explicitly allow certain directories/files. Most search engines do not support this at the moment.

Format of the robots.txt file

Below is a sample robots.txt file.

typical robots.txt file format

In the above example, all files in four directories and additionally one specific file, are excluded. Note that you need to place a separate Disallow line for every directory or specific file that you want excluded. Also, you cannot use regular expressions to specify multiple directories. For instance, an entry like - Disallow: /doc/*.pvt would be invalid. The only place where a * character is allowed is in the User-agent field here the * indicates all robots.

When a search engine robot finds a robots.txt file as in the above example, it will crawl though all directories and files except the ones listed above against Disallow fields.


A point of caution: The robots.txt file is publicly viewable, thereby exposing the internal structure of your website directories and files. Hence, you should organize your directories and files in such a manner that you do not have to expose too much. See below for a suggested structure.

typical website directory structure


Some more Examples

If you do not wish to exercise any restrictions for search engine robots and allow access to your entire website, you can do one of the following:

  1. Do not have a robots.txt file.
  2. Have an empty robots.txt file.
  3. Have a robots.txt file with the following entry:

    User-agent: *
    Disallow:


Disallow all search engines completely

User-agent: *
Disallow: /


Disallow all search engines completely, and only allow Google to crawl your website

User-agent: *
Disallow: /
User-agent: Googlebot
Disallow:



Submitting a Sitemap using robots.txt

You would have noticed in the first image above that there is a sitemap entry - Sitemap: http://www.how2lab.com/sitemap.xml. This entry allows you to tell search engines that you have an xml sitemap prepared especiialy for them and where it is located. (Read more about How to build a search engine friendly sitemap).

Note that this directive is independent of the User-agent line, hence it does not matter where your place it in your robots.txt file.


Share:
Buy Domain & Hosting from a trusted company
Web Services Worldwide
About the Author
Rajeev Kumar
CEO, Computer Solutions
Jamshedpur, India

Rajeev Kumar is the primary author of How2Lab. He is a B.Tech. from IIT Kanpur with several years of experience in IT education and Software development. He has taught a wide spectrum of people including fresh young talents, students of premier engineering colleges & management institutes, and IT professionals.

Rajeev has founded Computer Solutions & Web Services Worldwide. He has hands-on experience of building variety of websites and business applications, that include - SaaS based erp & e-commerce systems, and cloud deployed operations management software for health-care, manufacturing and other industries.


Refer a friendSitemapDisclaimerPrivacy
Copyright © How2Lab.com. All rights reserved.