msnbot, or someone purporting to be msnbot, is ignoring my robots.txt file, hitting an expensive, database-backed page on grumet.net over a thousand times yesterday. What’s up with that? Update Shimon points out an error in my robots.txt file (see comments). That’s fixed now. We’ll see if it makes a difference.

One thought on “”

  1. I hear you. Remember that feed recommendations engine I wrote, the one that does a 10-15 second query for each set of recommendations? Msnbot was crawling it, or perhaps I should say running over it. The interface listed all the SYO users, so I grant you that there were probably 30,000 ways to invoke my *really* expensive DB-backed page. And it wasn’t in my robots.txt — my bad.

    But still, I think msnbot went overboard. It was hitting those pages 10+ times concurrently despite their slowness, causing my server to crash from overload. (I have taken down the recommendations engine until I figure out how to deal.) As of today, msnbot has hit my web server 55071 times this month, accounting for 42.06% of my total hits.

    Your problem, however, is much simpler to solve. Some of the paths in your /robots.txt have a trailing *. The * is not valid on the end of a URL. Take away all the * characters in your robots.txt, except after user agent wildcard, and you should be fine.

    http://www.robotstxt.org/wc/norobots.html

    In related news, the White House has an interesting robots.txt. Every other line contains the word “iraq”.

    http://www.whitehouse.gov/robots.txt

    Like

Comments are closed.