TechFold - Bold tech & web commentary
Bold tech & web commentary
TechFold is technology discussion, commentary, reviews, and opinions from well outside the valley. There's no koolaid to drink here, and TechFold is not in SL, or on Twitter.
Twitter Spamming
I saw some referral traffic from Twitter this AM, and went over to see who had Twittered me. At which point, I realized (a) there doesn’t appear to be a way to search Twitter, and (b) Twitter is vulnerable to spamming. Here’s the frontpage that I saw:

Note the “usaid3” posts about Sony buying Microsoft. Breaking news. Can’t see what’s behind the TinyURL. Tempted to click? Well usaid3 wants you to click really bad - its a script that’s bombing the twitter front page, posting the same thing every few seconds:

In the time that it took to write the post, the Twitter being send by usaid3 switched to “Twitter TV” - pointing to the same TinyURL.
Oddly, that TinyURL goes to a defunct MySpace page. Confused? Me too. Looks like a spambot that outlived the page it was trying to drive traffic too.
Of course, I don’t imagine too many regular Twitter users are hanging out on the front page. That being said, it still seems odd that Twitter doesn’t appear to have any anti-spam measures in place - I can’t imagine a much clearer spam flag than the same thing being posted from the same account identically hundreds of times every few minutes.
Perhaps Twitter should write a hook into Akismet, or validate posting IP’s against http:BL. Doesn’t seem like rocket science to me.
myspace, spam twitterIf you enjoyed this post, make sure you subscribe to my RSS feed!
http:BL - Anti-Spam Tech for Servers from the HoneyPot Project
Let me preface this by being very clear that this sort of thing is more or less beyond my level of technical acumen. That being said, it sounds like the next step in the battle against spammers and harvesters, and is being well received on Digg: http:BL surfaces the database of Project Honeypot - a two year old project to identify spammers & email address harvesters. More importantly, it does so through server modules, and a web API, and returns rich data about IP’s as opposed to a yea/nay blacklist response.
Project Honey Pot is the first and only distributed system for identifying spammers and the spambots they use to scrape addresses from your website. Using the Project Honey Pot system you can install addresses that are custom-tagged to the time and IP address of a visitor to your site. If one of these addresses begins receiving email we not only can tell that the messages are spam, but also the exact moment when the address was harvested and the IP address that gathered it. [from: About Us]
http:BL makes this database available as an Apache module that can be used to block access to your site for spam-originators, as well as a web API. Using either, you can get a formatted response from the http:BL describing each IP that you query (i.e.: each IP visiting your site) and then act on it:
The first octet (127 in the example above) is always 127 and is pre-defined to not have a specified meaning related to the particular visitor.
The second octet represents the number of days since last activity. In the example above, it has been 3 days since the last time the queried IP address saw activity on the Project Honey Pot network. This value ranges from 0 days to 255 days. This value is useful in helping you assess how “stale” the information provided by http:BL is and therefore the extent to which you should rely on it.
The third octet (5 in the example above) represents a threat score for IP. This score is assigned internally by Project Honey Pot based on a number of factors including the number of honey pots the IP has been seen visiting, the damage done during those visits (email addresses harvested or forms posted to), etc. The range of the score is from 0 to 255, where 255 is extremely threatening and 0 indicates no threat score has been assigned. In the example above, the IP queried has a threat score of “5″, which is relatively low. While a rough and imperfect measure, this value may be useful in helping you assess the threat posed by a visitor to your site.
The fourth octet (1 in the example above) represents the type of visitor. Defined types include: “search engine,” “suspicious,” “harvester,” and “comment spammer.” Because a visitor may belong to multiple types (e.g., a harvester that is also a comment spammer) this octet is represented as a bitset with an aggregate value from 0 to 255. In the example above, the type is listed as 1, which means the visitor is merely “suspicious.” A chart outlining the different types appears below. This value is useful because it allows you to treat different types of robots differently.
[From: API Spec]
This is powerful - a big step beyond standard blacklisting, and the API access means its likely to turn up in all sorts of places: I can see someone quickly adding it to WordPress, for example, as a plugin complement to Akismet.
There’s already people out there writing code to protect their blog comment forms, and no doubt more to follow.
apache, blacklist, honeypot, projecthoneypot spamIf you enjoyed this post, make sure you subscribe to my RSS feed!
How Wordpress.com Stays Spam Free
Plagiarism Today has a very interesting article on how WordPress.com (Automattic’s free blog hosting service) stays relatively spam and splog free compared to - for example - Google’s BlogSpot, which PT figures is between 50 and 77% splogs. Part interview with Matt Mullenweg and part conjecture, PT posits that Wordpress’s anti-spam success can be attributed to:
- Akismet: Spam trackbacks & comments that point back to Wordpress.com-hosted blogs are conveniently identifying splogs - which can then be assessed and removed.
- No Advertising: Banning AdWords removes a most of the motivation to run a spam blog.
- Human Support: Mullenweg tells PT that using the “Report as Spam” feature will get a given site looked at and (if justified) removed within an hour.
So - a combination of technological edge, intelligent business decisions, and a corporate committment to root out spam keep WordPress.com clean.
You can add “social networking” to the anti-spam features list as well, as pointed out in the context of LiveJournal, in another PT article: by assuming that the only LiveJournal Friends of a spam blog are also spam blogs (a valid assumption), LJ can take down entire networks of spam blogs rapidly. Effectively, attempting to leverage LJ’s social networking features makes splogs vulnerable.
So - the question of the day is: Why can’t Google’s stable of doctorate-holding engineers figure this out? Google has the skills, the infrastructure, and the data to identify and remove spam, as well as the corporate mandate (”don’t be evil?”) to do so. However, Google also has the ad network whose inventory is kept inflated by BlogSpot’s millions of splogs. So - what do you think? Does BlogSpot need to look for a technological solution, a more active anti-spam community, or a new relationship with corporate-parent Google and AdSense?
EDIT: Using splogs to build PageRank is an issue too. Perhaps its time to consider something like adding no-follow tags to links in blogs that have less than a certain number of non-spam posts? i.e.: once your blog has been around long enough to be reliably identified as not-a-splog, it would graduate to “trusted” status - and links you post would no longer have the no-follow attribute and would contribute to pagerank… just a thought.
abuse, akismet, blogspot, google, livejournal, spam, splogs worpressIf you enjoyed this post, make sure you subscribe to my RSS feed!

Subscribe to RSS Feed
Subscribe to TechFold RSS




