Penguin, small TECH.BARWICK.DE
Search Engines and Optimization
 

Recent posts

Categories

Archive

Syndication

 



Powered By

Info


Wednesday, February 9, 2011   5:31 AM

magpie-crawler/1.1 (Brandwatch)

I was idly watching some Apache access logs scroll by (well actually I was busy doing something, but like to keep an eye on things to spot any interesting or worrying trends early) and noticed a bunch of entries like this:

94.228.34.238 - - [06/Feb/2011:06:21:23 +0100] "GET /blog/?o=10 HTTP/1.1" 301 290 "-" "magpie-crawler/1.1 (U; Linux amd64; en-GB; +http://www.brandwatch.net)"
94.228.34.238 - - [006/Feb/2011:06:21:33 +0100] "GET /blog/?o=40 HTTP/1.1" 301 290 "-" "magpie-crawler/1.1 (U; Linux amd64; en-GB; +http://www.brandwatch.net)"

Never noticed that UA before, and it never seemed to follow up on the (perfectly valid) 301 redirects. A scan through 3 months of logs shows it's been doing that all the time - what a dumb bot.

Checking the URL provided, this appears to be the home page of a UK-based company providing "Social Media Monitoring Tools" - and who don't have the courtesy to provide any more information about their bot / crawler. Which is evidently not popular in some quarters.

So, as "Brandwatch" provides neither myself not the sites I run with any conceivable benefit, it's on the blocklist they go.

(I wonder if they monitor their own brand?)



Tuesday, August 31, 2010  12:50 AM

Purebot from www.puritysearch.net

I noticed this fella hammering away at one of my sites:

195.42.102.25 - - [31/Aug/2010:02:16:10 +0200] "GET /some/url.html HTTP/1.1" 200 13427 "-" "Mozilla/5.0 (compatible; Purebot/1.1; +http://www.puritysearch.net/)"
195.42.102.25 - - [31/Aug/2010:02:16:12 +0200] "GET /some/other/url.html HTTP/1.1" 200 55822 "-" "Mozilla/5.0 (compatible; Purebot/1.1; +http://www.puritysearch.net/)"

The URL in the browser UA string appears at first glance to be some kind of search site, but the "category link" on the page are stuffed with the kind of keywords you associate with spam of all kinds and no actual search results were returned for a couple of common keywords.

It now has its own entry in the Bad Bot List and some special rules at firewall level. Hasta la vista.



Tuesday, January 5, 2010   3:45 PM

Web Royalty - ignoble spam

Here at Penguin Blogs, Inc. we get a fair bit of comment spam. Most of it is automatically blocked by a fairly ingenious filter mechanism, but from time to time unknowns get through such as contentless posts like the following:

Very nice posting. I liked it.
thank you for your great posting.
Well that was a nice post

Purportedly this was written by a "Nick Matyas" of "Web Royalty" - what looks to be a legitimate SEO consultancy (look them up yourself, I'm not giving them the benefit of a link) but who are either using underhanded spamming methods, or have made a bad choice in outsourcing their own SEO.

Whatever, their URL is now on the filter list for this and many other sites, so they won't be troubling us here again.



Wednesday, December 19, 2007   9:31 AM

Research my 403s, GigaMega

Deja-vu all over again...

74.86.249.98 - - [19/Dec/2007:10:17:22 +0100] "GET /path/to/file HTTP/1.1" 200 10049 "-" "Mozilla/5.0 (compatible; Gigamega.bot/1.0; +http://www.gigamega.net/bot.html)"



Tuesday, November 27, 2007  12:30 PM

voyager-hc/1.0

38.113.234.181 - - [27/Nov/2007:13:21:24 +0100] "GET /robots.txt HTTP/1.0" 200 612 "-" "voyager-hc/1.0" 38.113.234.181 - - [27/Nov/2007:13:21:35 +0100] "GET /path/to/some/file.html HTTP/1.0" 301 363 "-" "voyager-hc/1.0"

38.113.234.181 resolves to crawl1.cosmixcorp.com, and cosmixcorp.com redirects to kosmix.com - a California, USA-based outfit which appears to be legit in a "we're a cool California start-up" kind of way. Not quite sure what they're doing (hey - it's Web 2.0), but it evidently involves crawling without an identifiable bot UA.

Our secret sauce (all Web 2.0 companies need one) is our categorization engine that crawls billions of Web pages in a unique manner to create algo-generated home pages…more on this later.

In the meantime kosmix.com has vanished from the internet - and good riddance.



Wednesday, November 7, 2007   2:26 PM

Litefinder.net - another bot bites the dust

A string of entries from a bot calling itself LiteFinder/1.0: never heard of it, though the URL provided (http://www.litefinder.net/about.html) does work and claims it's a "a research project started by a group of Indian candidates from the cities of Bangalore, Patna and Jaipur.".



Thursday, August 3, 2006   5:57 AM

Pockey-GetHTML - a very stupid bot

This stupid bot doesn't understand UTF-8 encoded URLS...

220.208.55.xxx - - [03/Aug/2006:07:45:03 +0200] "GET /ã??ã?«
ã??.html HTTP/1.1" 404 2422 "-" "Pockey-GetHTML/4.14.1 (Win32; GUI; ix86)
"
220.208.55.xxx - - [03/Aug/2006:07:45:04 +0200] "GET /ã??ã?£
ã?ªã??ã?³.html HTTP/1.1" 404 2422 "-" "Pockey-Get
HTML/4.14.1 (Win32; GUI; ix86)"
220.208.55.xxx - - [03/Aug/2006:07:45:06 +0200] "GET /è??å?¤
�治�.html HTTP/1.1" 404 2422 "-"
 "Pockey-GetHTML/4.14.1 (Win32; GUI; ix86)"


Tuesday, May 16, 2006  11:20 PM

Neomo

This morning I found one of my sites had been subjected to a deep crawl by a bot naming itself "Francis/2.0 (francis@neomo.de http://www.neomo.de/)". The site seems to be an experimental but legitimate German-language search engine. The first hits from the bot were to robots.txt, the although the site's crawler information page doesn't indicate what entries it interprets, if any. Requests look like this:

85.10.204.13 - - [16/May/2006:19:19:09 +0200] "GET /robots.txt HTTP/1.1" 206 390 "-" "Francis/2.0 (francis@neomo.de http://www.neomo.de/)"
85.10.204.13 - - [16/May/2006:19:19:09 +0200] "GET /robots.txt HTTP/1.1" 206 390 "-" "Francis/2.0 (francis@neomo.de http://www.neomo.de/)"
85.10.204.13 - - [16/May/2006:19:19:24 +0200] "GET / HTTP/1.1" 206 1949 "-" "Francis/2.0 (francis@neomo.de http://www.neomo.de/)"
85.10.204.13 - - [16/May/2006:19:19:25 +0200] "GET / HTTP/1.1" 206 1949 "-" "Francis/2.0 (francis@neomo.de http://www.neomo.de/)"

Interestingly all requests returned with HTTP status 206.