Spam

Derek Bridge

Aims:

Image of Hormel spam from Wikipedia (Spam (food))

US: The Controlling the Assault of Non-Solicited Pornography and Marketing Act (CAN-SPAM Act) of 2003
- Allows unsolicited commercial email if it contains no fake headers plus an opt-out
Europe: Directive on Privacy and Electronic Communications, 2003
- Allows unsolicited commercial email if it contains no fake headers and the recipient has opted-in
Discuss: Opt-in or opt-out?
Discuss: Enforceability
Note: some spam-related activity is definitely illegal (fraud, theft, damage)

Acquiring email addresses, e.g.:
- Web spiders
- Viruses
- Buy lists for a few dollars
Validating email addresses, e.g.:
- May not bother
- Invitations to opt out
- Embedded content in HTML email (images, stylesheets, ...)

Sending the spam, e.g.:
- Use web-bots to create webmail accounts
- Use insecure devices (relays, proxies, routers)
- Create zombies by infecting machines with viruses
Disguising the spam (spamouflage), e.g.:
- Fake email headers
- Obfuscated content: Viagra Via'gra V I A G R A Vaigra \/iagra Vi@gra
- All-image content
- Spam salad

Install spam filters at various points (ISP/company mail server, user's mail agent)
Why is spam filtering so challenging?
- Spam is diverse
- Spam is subjective
- Spam is a moving target
- Filtering errors have different 'costs':
  - Spam classified as ham: irritating
  - Ham classified as spam: significant

Image of a captcha dialog from The Official Captcha site

A DNSBL lists IP addresses of machines that send or relay spam (e.g. known spammers, insecure devices, zombies)
Email servers contact the list operator's server to check an IP address
They can reject or flag email which has been sent from any address on the list
How does the list operator populate the list? E.g.:
- Customers can nominate addresses they think should be on the list
- List operators may use a program to scan the Internet looking for insecure devices
- List operators may set up spamtraps
See e.g. Wikipedia's Comparision of DNSBLs
Problems:
- Innocent parties end up on lists and find it hard to get off again
- Spammers attack the list operator's server
The user's mail agent usually also allows blacklists/whitelists of email addresses

Uses an algorithm that can compute a signature/fingerprint for an email based on the email's structure/content
When a new email arrives, compute its signature
Look up this signature in a database to see whether it matches the signatures of past spam
Why's it called "collaborative"?
- Relies on the community of users: when spam gets through, they should add its signature to the database
Two similar emails must generate the same signature
Best-known example: Vipul's Razor
Problems
- users often don't bother to contribute
- spammers add spam salad so that signatures don't match

Rules about headers, structure, content, format, e.g.:
- if From header starts with many numbers then email is spam (2.302)
- if body attempts to disguise the word 'viagra' then email is spam (2.203)
- if body includes HTML then email is spam (0.001)
Scores of the rules that match are combined for an overall judgment
Problem:
- Spam keeps changing so we have to keep adding rules

A technique from Artificial Intelligence!
From a set of spam and a set of ham, compute conditional probabilities, e.g.
- Prob(spam | viagra) = 0.95, Prob(ham | viagra) = 0.05
- Prob(spam | satisfy) = 0.6, Prob(ham | satisfy) = 0.4
When a new email arrives, compute prob. it is spam and prob. it is ham by combining the conditional probabilities using Bayes law
Also can update the conditional probabilities from the new emails: copes with change
Problem:
- spam salad may throw things off track a little