Spam
Derek Bridge
Spam
Aims:
- to learn why spam filtering is so hard
- to know the main techniques for spam filtering
- to learn how to avoid spam
Spam
-
What is spam? http://www.spamhaus.org/
- Why is it sent?
- Statistically speaking, who sends it and from where?
- How much spam is sent?
- So what?
Image of Hormel spam from Wikipedia (Spam (food)
)
Is spam legal?
- US: The Controlling the Assault of Non-Solicited Pornography and Marketing Act
(CAN-SPAM Act) of 2003
- Allows unsolicited commercial email if it contains no fake headers plus an
opt-out
- Europe: Directive on Privacy and Electronic Communications, 2003
- Allows unsolicited commercial email if it contains no fake headers and the
recipient has opted-in
- Discuss: Opt-in or opt-out?
- Discuss: Enforceability
- Note: some spam-related activity is definitely illegal (fraud, theft, damage)
How do spammers spam?
- Acquiring email addresses, e.g.:
- Web spiders
- Viruses
- Buy lists for a few dollars
- Validating email addresses, e.g.:
- May not bother
- Invitations to opt out
- Embedded content in HTML email (images, stylesheets, ...)
How do spammers spam?
- Sending the spam, e.g.:
- Use web-bots to create webmail accounts
- Use insecure devices (relays, proxies, routers)
- Create zombies by infecting machines with viruses
- Disguising the spam (spamouflage), e.g.:
- Fake email headers
- Obfuscated content: Viagra Via'gra V I A G R A Vaigra \/iagra Vi@gra
- All-image content
- Spam salad
Spam filters
- Install spam filters at various points (ISP/company mail server, user's mail
agent)
- Why is spam filtering so challenging?
- Spam is diverse
- Spam is subjective
- Spam is a moving target
- Filtering errors have different 'costs':
- Spam classified as ham: irritating
- Ham classified as spam: significant
Challenge-response spam filters
- In a challenge-response spam filter,
- Your email server receives an email from an unknown sender
- It saves it temporarily
- It sends back an email to the sender
-
The sender must reply to this email
- Sometimes sender is required to solve a captcha
- Ensures the sender is not a program
- If sender does reply, the original, saved email is delivered
- Class exercise: Give some of the (numerous) problems with this!
- Variants: SPF, graylists
Image of a captcha dialog from The Official Captcha site
DNS BlackList (DNSBL) spam filters
- A DNSBL lists IP addresses of machines that send or relay spam
(e.g. known spammers, insecure devices, zombies)
- Email servers contact the list operator's server to check an IP address
- They can reject or flag email which has been sent from any address on the list
- How does the list operator populate the list? E.g.:
- Customers can nominate addresses they think should be on the list
- List operators may use a program to scan the Internet looking for insecure devices
- List operators may set up spamtraps
- See e.g.
Wikipedia's Comparision of DNSBLs
- Problems:
- Innocent parties end up on lists and find it hard to get off again
- Spammers attack the list operator's server
- The user's mail agent usually also allows blacklists/whitelists of email addresses
Collaborative spam filters
- Uses an algorithm that can compute a signature/fingerprint for an email
based on the email's structure/content
- When a new email arrives, compute its signature
- Look up this signature in a database to see whether it matches the signatures of past spam
- Why's it called "collaborative"?
- Relies on the community of users: when spam gets through, they should add its
signature to the database
- Two similar emails must generate the same signature
- Best-known example: Vipul's Razor
- Problems
- users often don't bother to contribute
- spammers add spam salad so that signatures don't match
Rule-based spam filters
- Rules about headers, structure, content, format, e.g.:
- if From header starts with many numbers then email is spam (2.302)
- if body attempts to disguise the word 'viagra' then email is spam (2.203)
- if body includes HTML then email is spam (0.001)
- Scores of the rules that match are combined for an overall judgment
- Problem:
- Spam keeps changing so we have to keep adding rules
Bayesian spam filters
- A technique from Artificial Intelligence!
- From a set of spam and a set of ham, compute conditional probabilities, e.g.
- Prob(spam | viagra) = 0.95, Prob(ham | viagra) = 0.05
- Prob(spam | satisfy) = 0.6, Prob(ham | satisfy) = 0.4
- When a new email arrives, compute prob. it is spam and prob. it is ham by
combining the conditional probabilities using Bayes law
- Also can update the conditional probabilities from the new emails: copes with change
- Problem:
- spam salad may throw things off track a little
Case study: SpamAssassin
- SpamAssassin is free and open-source:
http://spamassassin.apache.org/
- Like all good commercial filters, it combines many techniques:
- A form of challenge-response (SPF)
- Blacklists
- Collaborative (Vipul's Razor)
- Manual rules
- Bayesian probabilities on the rules
- It uses another techniques from Artificial Intelligence to determine the scores