Tuesday, August 31, 2010 12:50 AM
I noticed this fella hammering away at one of my sites:
195.42.102.25 - - [31/Aug/2010:02:16:10 +0200] "GET /some/url.html HTTP/1.1" 200 13427 "-" "Mozilla/5.0 (compatible; Purebot/1.1; +http://www.puritysearch.net/)"
195.42.102.25 - - [31/Aug/2010:02:16:12 +0200] "GET /some/other/url.html HTTP/1.1" 200 55822 "-" "Mozilla/5.0 (compatible; Purebot/1.1; +http://www.puritysearch.net/)"
The URL in the browser UA string appears at first glance to be some kind of search site, but the "category link" on the page are stuffed with the kind of keywords you associate with spam of all kinds and no actual search results were returned for a couple of common keywords.
It now has its own entry in the Bad Bot List and some special rules at firewall level. Hasta la vista.
Tuesday, January 5, 2010 3:45 PM
Here at Penguin Blogs, Inc. we get a fair bit of comment spam. Most of it is automatically blocked by a fairly ingenious filter mechanism, but from time to time unknowns get through such as contentless posts like the following:
Very nice posting. I liked it.
thank you for your great posting.
Well that was a nice post
Purportedly this was written by a "Nick Matyas" of "Web Royalty" - what looks to be a legitimate SEO consultancy (look them up yourself, I'm not giving them the benefit of a link) but who are either using underhanded spamming methods, or have made a bad choice in outsourcing their own SEO.
Whatever, their URL is now on the filter list for this and many other sites, so they won't be troubling us here again.
Wednesday, December 19, 2007 9:31 AM
Deja-vu all over again...
74.86.249.98 - - [19/Dec/2007:10:17:22 +0100] "GET /path/to/file HTTP/1.1" 200 10049 "-" "Mozilla/5.0 (compatible; Gigamega.bot/1.0; +http://www.gigamega.net/bot.html)"
Tuesday, November 27, 2007 12:30 PM
38.113.234.181 - - [27/Nov/2007:13:21:24 +0100] "GET /robots.txt HTTP/1.0" 200 612 "-" "voyager-hc/1.0"
38.113.234.181 - - [27/Nov/2007:13:21:35 +0100] "GET /path/to/some/file.html HTTP/1.0" 301 363 "-" "voyager-hc/1.0"
38.113.234.181 resolves to crawl1.cosmixcorp.com, and
cosmixcorp.com redirects to kosmix.com - a California, USA-based
outfit which appears to be legit in a "we're a cool California start-up" kind of way. Not quite sure
what they're doing (hey - it's Web 2.0), but it evidently involves crawling without an identifiable
bot UA.
Our secret sauce (all Web 2.0 companies need one) is our categorization engine that crawls billions of Web pages in a unique manner to create algo-generated home pages…more on this later.
Wednesday, November 7, 2007 2:26 PM
A string of entries from a bot calling itself LiteFinder/1.0: never heard of it, though the
URL provided (http://www.litefinder.net/about.html) does work and claims it's a "a research project started by a group of Indian candidates from the cities of Bangalore, Patna and Jaipur.".
Thursday, August 3, 2006 5:57 AM
This stupid bot doesn't understand UTF-8 encoded URLS...
220.208.55.xxx - - [03/Aug/2006:07:45:03 +0200] "GET /ã??ã?«
ã??.html HTTP/1.1" 404 2422 "-" "Pockey-GetHTML/4.14.1 (Win32; GUI; ix86)
"
220.208.55.xxx - - [03/Aug/2006:07:45:04 +0200] "GET /ã??ã?£
ã?ªã??ã?³.html HTTP/1.1" 404 2422 "-" "Pockey-Get
HTML/4.14.1 (Win32; GUI; ix86)"
220.208.55.xxx - - [03/Aug/2006:07:45:06 +0200] "GET /è??å?¤
�治�.html HTTP/1.1" 404 2422 "-"
"Pockey-GetHTML/4.14.1 (Win32; GUI; ix86)"
Tuesday, May 16, 2006 11:20 PM
This morning I found one of my sites had been subjected to a deep crawl
by a bot naming itself
"Francis/2.0 (francis@neomo.de http://www.neomo.de/)". The
site seems to be an experimental
but legitimate German-language search engine. The first hits from the
bot were to robots.txt,
the although the site's
crawler information page
doesn't indicate what entries it interprets, if any. Requests look like this:
85.10.204.13 - - [16/May/2006:19:19:09 +0200] "GET /robots.txt HTTP/1.1" 206 390 "-" "Francis/2.0 (francis@neomo.de http://www.neomo.de/)"
85.10.204.13 - - [16/May/2006:19:19:09 +0200] "GET /robots.txt HTTP/1.1" 206 390 "-" "Francis/2.0 (francis@neomo.de http://www.neomo.de/)"
85.10.204.13 - - [16/May/2006:19:19:24 +0200] "GET / HTTP/1.1" 206 1949 "-" "Francis/2.0 (francis@neomo.de http://www.neomo.de/)"
85.10.204.13 - - [16/May/2006:19:19:25 +0200] "GET / HTTP/1.1" 206 1949 "-" "Francis/2.0 (francis@neomo.de http://www.neomo.de/)"
Interestingly all requests returned with HTTP status 206.