It is very common to publish Web pages, images, and other files on your Web site that you do not want everyone to see. By using robots.txt, you can make broad declarations about which files are browsed by robots (and potentially found by users) and which files are hidden (unless you tell someone exactly where they are!).
As defined on the official robots.txt site, “a robot is a program that automatically traverses the Web’s hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.” For us humans, a robot is simply an automated “thing” that views your Web pages and follows any files referenced, such as HTML files, JavaScript files, CSS files, images, and more.
A robots.txt file is a plain text file containing simple declarations that a robot reads upon visiting your site. The process is similar to asking your parents if you could go out when you were younger. The robot (child) visits the Web page (parent) and asks what it has permission to do. If there’s no robots.txt file, the robot (just as many children would) goes wherever it wants. If there is, it reads the file and abides by the rules.
Once created, the robots.txt file should be uploaded to the root directory of your Web site. When search engines like Google and Yahoo! visit your site, they will automatically look for the file.
If you do not upload a robots.txt file, the robot will assume that there are no restrictions for “crawling” your files and go about its business. Depending on your personal level of nerdiness, you might see 404 (page not found) errors in your server logs from robots requesting the robots.txt file.
To create your robots.txt file, open a plain text editor like Notepad. The file will consist of one or several declarations. Each declaration starts with the text, “User-agent:.” Underneath this are all of the declarations for allowing and disallowing the indexing of certain files on your Web site.
A sample robots.txt file looks something like this:
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
This robots.txt file would prevent any search engine spider from browsing the “cgi-bin” and “images” folder of your Web site. In a robots.txt file, “*” stands for “any user agent.” If you know the name of the user-agent you want to disallow, you can do so by specifying its name:
User-agent: Googlebot
Disallow: /family-photos/
This robots.txt file would prevent Google from looking at any of the files in the “family-photos” folder. This comes in handy if you’re the type that likes to post photos of dysfunctional family events. All other search engines would be able to view the contents of those folders, since Google is the only one disallowed.
You must be logged in to post a comment.





December 9th, 2005 at 8:27 am
what a great article christopher
- i always wondered what those file were for!
many thanks
robert
December 30th, 2006 at 5:42 pm
Great blog, exactly the info I was looking for. So, one question, though. If I didn’t want any search engine from indexing any part of my site, would this be the code to include in the text file?
User-agent: *
Disallow: /