Example Code:
http://---- .com / robots.txt
----------- disallow all (*) after (http://----.com/)
User-Agent:*
Disallow: /
----------- disallow all (*) after (http://----.com/folder/)
User-Agent:*
Disallow: /folder/
-------------------------------------
http://www.w3.org/TR/html4/appendix/notes.html#h-B.4.1.1
example:
Disallow: /help
disallows both /help.html and /help/index.html, whereas
Disallow: /help/
would disallow /help/index.html but allow /help.html.
------------------------------------------
User-agent: *
Disallow: /cgi-bin/
Disallow: /privatedir/
Disallow: /hotnews/not4u.htm
------------------------
Allowing Googlebot
block all bots but allow Googlebot
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow:
------------------------
Block Googlebot entirely but allow ( Googlebot-Mobile ):
crawls pages
PS: ( Allow: ) this syntax may only work for Google bot only
User-agent: Googlebot
Disallow: /
User-agent: Googlebot-Mobile
Allow:
or
User-agent: Googlebot-Mobile
Disallow:
------------------------
User-Agent: Googlebot
Disallow: /folder1/
Allow: /folder1/myfile.html
------------------------
User-agent: Googlebot
Disallow: /folder
------------------------
User-agent: Googlebot
Disallow: /*.gif$
------------------------
User-agent: Googlebot
Disallow: /*?
------ more how to block googlebot ------
http://www.google.com/support/webmasters/bin/
answer.py?answer=40364&topic=8846
------ more control msnbot ------
http://search.msn.com/docs/siteowner.aspx?
t=SEARCH_WEBMASTER_REF_RestrictAccessToSite.htm
------ more yahoo bot info -----
http://help.yahoo.com/help/us/ysearch/slurp/
---------- some standard -------
http://www.robotstxt.org/wc/exclusion-admin.html
What to put into the robots.txt file
The "/robots.txt" file usually contains a record looking like this:
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~joe/
In this example, three directories are excluded.
Note that you need a separate "Disallow" line for every URL
prefix you want to exclude -- you cannot say
"Disallow: /cgi-bin/ /tmp/".
Also, you may not have blank lines in a record,
as they are used to delimit multiple records.
Note also that regular expression are not supported in either
the User-agent or Disallow lines. The '*' in the User-agent
field is a special value meaning "any robot". Specifically,
you cannot have lines like "Disallow: /tmp/*" or "Disallow: *.gif".
What you want to exclude depends on your server. Everything
not explicitly disallowed is considered fair game to retrieve.
Here follow some examples:
To exclude all robots from the entire server
User-agent: *
Disallow: /
To allow all robots complete access
User-agent: *
Disallow:
Or create an empty "/robots.txt" file.
To exclude all robots from part of the server
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /private/
To exclude a single robot
User-agent: BadBot
Disallow: /
To allow a single robot
User-agent: WebCrawler
Disallow:
User-agent: *
Disallow: /
To exclude all files except one
This is currently a bit awkward, as there is no "Allow" field.
The easy way is to put all files to be disallowed into a separate
directory, say "docs", and leave the one file in the level above
this directory:
User-agent: *
Disallow: /~joe/docs/
Alternatively you can explicitly disallow all disallowed pages:
User-agent: *
Disallow: /~joe/private.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html
|