Understanding Robots.txt

By on April 25, 2012

Unless you’ve got an IT professional on hand at all times, small business owners more frequently than not have to awkwardly don the hat of a web expert to keep their websites up and running.  Even when the biggest publically traded companies have a fulltime staff of the top information technology professionals on hand, the common issue brought on by the robots.txt issue still happens.  Becoming invisible to search engines is just about as bad as it gets in the great World Wide Web.

The Basics:

What are Search Engine Spiders?

These sneaky devils are the informative bits that search your website for content marked as available for web robots to retrieve and appropriately rank for the searcher.  These spiders or web crawlers essentially seek out information not masked by the robots.txt format.

How Does a Robots.txt Blockage Come About?

The usage of robots.txt formatting is most commonly used for staging servers.  If you find yourself at the mercy of a robots.txt problem, it likely stems from when your staging server was rolled over to the live server.  Web developers utilize the robots.txt format to prevent the duplication of your web content during the building process and when your site does eventually go live.

How To Check Your Site for Robots.txt

You are able to manually check your website to rule out the possibility that is suffering from the effects of an inappropriately placed robots.txt setting. No need to panic over the possibility of being Google blacklisted, keep calm and check the following simple steps:

  • Enter your domain name followed by a backslash and robots.txt in the address bar. For example: http://thedomainname.com/robots.txt
  • If a 404-error page is the result, then you may not have the robots.txt feature.
  • An additional route would be to log into your Google Webmaster Tools page to tell you which URLs include a robots.txt file restriction.
  • If your robots.txt file shows:

User-agent: *
Disallow: /

You’ll need to be sure to make changes.  You should never see the above coding on a live website.

How To Prevent Parts of Your Site From Being Indexed

Robot.txt can actually work to serve you just as they can hurt your website. To essentially hide certain sections of your website from these spiders or web crawlers, you can implement the features of the robots.txt formatting.  To disallow ads or log files from being searched on your website, these pages or features should be respectively coded:

User-agent: *
Disallow: /ads
Disallow: /logs

  • Unfortunately, the usage of the robots.txt isn’t a cure-all for those items you wouldn’t like searched. You may also notice the blanket effect of this feature. Basic protocol doesn’t allow for Wildcards in the Disallow line or “Allow:” lines.  Subsequently, Google has expanded this basic format issue to allow both of these options, but these are not universally accepted, so it is recommended that these expansions ONLY be used for a “User-agent:” run by Google.

Does the Robots.txt Prevent Users From Viewing Certain Content?

Absolutely not.  Adding the robots.txt to your web coding will only prevent web-screening spiders from selecting content from these portions of your site.  All content will be left for the viewing pleasure of all visitors to that page and will be completely unaware of the robots.txt status of the content on that page.  In all honesty the robots.txt will only disallow “polite” spiders from access to the information, in reality there are likely less well-mannered searchers weaving through that data.

If you really want to protect certain data, content or certain sections of your website, your best bet is to password protect these areas. Also remember that if you want content officially removed from the index, you must include a robots no index meta tag on each and every page you want to unequivocally remove from the index of your site.

Understanding the slightly more simplistic features of running and maintaining your website will likely save you money on the front and the running end of your business.  If you find that your website has disappeared from Google search or is extremely hard to find otherwise, your first step should be to double-check your robots.txt.  No need to spend extra money on a tech professional when you are well equipped to rule out the easy fixes and get back to the world of the living as far as the web is concerned!

 

 

 

Join Us in the Conversation...

We'd love to know your thoughts on this article.
Meet us over on Facebook, Google+ or Twitter to join the conversation right now!

Matthew Toren

About Matthew Toren

Matthew Toren is a serial entrepreneur, mentor, investor and co-founder of YoungEntrepreneur.com. He is co-author, with his brother Adam, of Kidpreneurs and Small Business, BIG Vision: Lessons on How to Dominate Your Market from Self-Made Entrepreneurs Who Did it Right (Wiley). He's based in Vancouver, B.C.