headerIMG

B2B Articles - May 27, 2011 - By Ironpaper

Using a robots.txt file to disallow a directory

Website owners and designers can provide instructions to search robots to deny or provide access to specific parts of their website. It can be helpful if you have specific pages or sections of a website that you do not want to show up in search results. The practice can be abused as well, as some engineers build robots to specific search for and cache content contained within restricted areas. Those more sensitive areas (called to in the robots.txt file) should be protected. The robots.txt file is only a set of instructions--not a security mechanism. Google and other major search robots do follow the instructions within the file. The /robots.txt file is a publicly available file

It works likes this: a robot wants to vists a Web site URL, say https://www.cool-website-example.com/mypage.html. Before it does so, it firsts checks for https://www.cool-website-example.com/robots.txt, and finds:

User-agent: *
Disallow: /

These instructions tell the bot to not cache any content within the website.

Here are a more specific set of instructions:

User-agent: *
Disallow: /cgi-bin/
Disallow: /data/
Disallow: /~bob/

You need to create a unique line for each directory that you are restricting against. You cannot have both /cgi-bin/ and /data/ on the same line as a separate set of instructions (like this "Disallow: /cgi-bin/ /data/. This just won't work. Instead use the model above.

Tel 212-993-7809  

Ironpaper ®
10 East 33rd Street 
6th Floor
New York, NY 10016
Map

Ironpaper - B2B Agency

B2B Marketing and Growth Agency.

Grow your B2B business boldly. Ironpaper is a B2B marketing agency. We build growth engines for marketing and sales success. We drive demand generation campaigns, ABM programs, B2B content, sales enablement, qualified leads, and B2B marketing efforts. 

Ironpaper Twitter Ironpaper Linkedin