Archive for November, 2005

Spam Filtering by Link Analysis

Posted by alex on November 15th, 2005

I’ve just thought of an interesting method of tackling spam. One thing that almost all unwanted emails have in common (almost all; I’ll get to the rest in a bit) is that they will point you towards a website of some sort so that you can purchase their products or so that they can steal your credit card details. Spammers are getting around current filtering methods by hiding the intention of their email within random bits of text (often from well known books or other documents) which fool filters looking out for common words and phrases. Others will not even put much text in their email at all but will attach an image with their advert which when clicked will link to their website.

One tactic filters could use (and probably do) is to mark messages with links as having more probability of being spam. But this doesn’t cover the short messages loads of people send saying something like “Hey check out this funny link…” etc.

A more effective method would be to analyse the link itself. With an email service such as Google Mail they are able to keep lots of information about spam to improve the spam filtering. For example it would be quite easy to keep a list of links that appear in messages that people have marked as spam. If lots of people mark the same links as being in spam messages then it provides an easy identifier for any future spam from that company.

However this method is likely to be evaded by using anonymous looking links. It would be fairly easy to use a service such as tinyurl.com which redirects you to a website given an identifier. The spammer could then create a new redirecting url for every message that was sent meaning no two urls would look the same.

A better method would be for the spam filter to follow the links given in spam messages. If a link redirects to lead to a blacklisted url then that message is most likely to be spam. All the filter would need to do is check the header of the url being pointed to to find the redirection link. To get round this method the spammer would have to keep changing the address of their website which could be quite costly for them and would limit the number of potential visits to their site.

An even better and more reliable method would be for the filter to follow the link and download the actual webpage. Analysing the words on the webpage would be far more effective than analysing the words from the email. It would also be possible to identify the website from the content rather than by its address which would make it a lot easier to blacklist against.

The problem with this of course is the resources required to download a webpage every time someone links to a website through email. To limit this the standard methods of spam filtering should be used as well. If a message looks like it might be spam then the link should be analysed. If the link is blacklisted then the message should be marked as spam. If the domain of the link is whitelisted then it should be marked as not spam. Otherwise the link should be followed. If the domain of the webpage is blacklisted then the message should be marked as spam. If the domain or address is whitelisted then it should be marked as not spam. Otherwise the content of the webpage should be downloaded. The webpage should then be analysed. It should then be added to the blacklist or whitelist based on its content.

Of course the process would probably be a little more complex than that but that’s the basic idea. The whitelist and blacklist should be effective in reducing the amount of traffic and unnecessary downloading.

I mentioned earlier that there were some spams that didn’t include links. As far as I know the only ones of this type I’ve seen are those related to the stock market which are intended to cause a stir among investors and drive the price of shares up. This seems to be a relatively new type of spam but of course once spam filters adjust to block messages with words such as st0ck or invest0r we should see fewer and fewer of them.

Animated Gif

Posted by alex on November 13th, 2005

Got an hour to waste? Have a look at this huge animated gif.

Galaxy Simulations

Posted by alex on November 12th, 2005

I was recently pointed to some awesome simulations involving galaxies and what might happen when they collide. The simulations were created on a super computer and enable us to view what might happen within the period of several billion years in just a few minutes. A particularly interesting one shows the mergence of the Milky Way with the Andromeda galaxy which is predicted to start to occur in the next 3-4 billion years or so. Some of the other simulations involve hundreds of galaxies interacting with each other. The movies are pretty big so you’ll need a decent internet connection to download them.