There are two main reasons why you need to tell the search engine robots not to crawl certain sections of your website directories and files.
Directories that store your backend programs, associated class files and javascript files serve no purpose for the search engines. Hence why expose these to them.
Besides, when a search engine robot is crawling on your website, drilling down to every directory, sub-directory and files; it is consuming bandwidth. Bandwidth is money because you pay to your hosting provider for the bandwidth that you consume. Naturally, you would like the precious bandwidth to be available more for your site visitors and less for your search engines.
Hence, the need to tell search engines what not to crawl.
Fortunately, the world wide web provides for an REP (Robots Exclusion Protocol) for communicating this to the search engines. When an REP compliant robot (most search engine robots are) visits your website, it first looks for a robots.txt file in your website home directory. If this file exists, it will read through all instructions contained in the file and accordingly decide what not to crawl. So, it is this file where you need to tell the search engines not to crawl certain sections of your website directory and file structure.
Below is a sample robots.txt file.
In the above example, all files in four directories and additionally one specific file, are excluded. Note that you need to place a separate Disallow line for every directory or specific file that you want excluded. Also, you cannot use regular expressions to specify multiple directories. For instance, an entry like - Disallow: /doc/*.pvt would be invalid. The only place where a * character is allowed is in the User-agent field here the * indicates all robots.
When a search engine robot finds a robots.txt file as in the above example, it will crawl though all directories and files except the ones listed above against Disallow fields.
A point of caution: The robots.txt file is publicly viewable, thereby exposing the internal structure of your website directories and files. Hence, you should organize your directories and files in such a manner that you do not have to expose too much. See below for a suggested structure.
If you do not wish to exercise any restrictions for search engine robots and allow access to your entire website, you can do one of the following:
User-agent: *
Disallow:
Disallow all search engines completely
User-agent: *
Disallow: /
Disallow all search engines completely, and only allow Google to crawl your website
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow:
You would have noticed in the first image above that there is a sitemap entry - Sitemap: http://www.how2lab.com/sitemap.xml. This entry allows you to tell search engines that you have an xml sitemap prepared especiialy for them and where it is located. (Read more about How to build a search engine friendly sitemap).
Note that this directive is independent of the User-agent line, hence it does not matter where your place it in your robots.txt file.
Building multiple sitemaps for very large websites
How to build a search engine friendly sitemap?
What is the purpose of the robots meta tag and how to use it?
How to move your Email accounts from one hosting provider to another without losing any mails?
How to resolve the issue of receiving same email message multiple times when using Outlook?
Self Referential Data Structure in C - create a singly linked list
Mosquito Demystified - interesting facts about mosquitoes
Elements of the C Language - Identifiers, Keywords, Data types and Data objects
How to pass Structure as a parameter to a function in C?
Rajeev Kumar is the primary author of How2Lab. He is a B.Tech. from IIT Kanpur with several years of experience in IT education and Software development. He has taught a wide spectrum of people including fresh young talents, students of premier engineering colleges & management institutes, and IT professionals.
Rajeev has founded Computer Solutions & Web Services Worldwide. He has hands-on experience of building variety of websites and business applications, that include - SaaS based erp & e-commerce systems, and cloud deployed operations management software for health-care, manufacturing and other industries.