smj's notes from the 2004 MIT Spam Conference
This summary Copyright 2004 Steven M Jones, all rights reserved.

Date: Mon, 19 Jan 2004
From: "Steven M. Jones"
Subject: Notes from MIT Spam Conference

I was fortunate enough to attend the 2nd Spam Conference over at MIT on Friday the 16th of January. The conference itself is extremely informal, with only a few individuals coordinating it - no proceedings, no t-shirts ;^) The web site (www.spamconference.org) has a list of the speakers and titles, but no links to the presentations. I've sent a note off asking about links to the slides, and expect to hear something back in a week or so (as of 21-Jan-2004).

In the meantime, you can check out the webcast here.

Please remember, all this is from a single day. Seven pages of notes, and I was assuming I'd be able to get the slides later so I was terse! There's another conference coming up in July in Mountain View, see www.ceas.cc for details.

Terry Sullivan
The Myth of Spam Volatility
Other papers

Terry's work involved statistical analysis of the features that make spam what it is. The summary was that these features are not the quickly morphing Terminator 2-like tentacles everybody seems to assume. He took ten calendar quarters of spam and used principal components analysis (PCA or PC) to try and see how the features present changed over time.

The resulting graphs look remarkably like the punctuated equilibrium of evolutionary models -- some change, stability for a while, some more change, rinse repeat. (I'm not clear how much of that might result from the quarter or month sized buckets being used, I don't grok PCA.) Aside from volatility, the key point was that most of these features don't change much over time, and that there's a core "spaminess" that is very stable. Payoff? That if you distill them you can have a milter that cuts down 1/3 to 1/2 of all spam at the gateway in a very computationally cheap manner, and avoid the performance hit of Bayesian or other filters.

The presentation was nice in that it may formalize what previously has been an anecdotal notion. But, if you think about it, this is what most of Spam Assassin is already... Or I'm not thinking about it clearly.

Shlomo Hershkop, Columbia University
www1.cs.columbia.edu/ids/publications/EMT-ACNS03.pdf

Schlomo was interested in applying some tenets of machine learning and perhaps data mining to modeling email flows, with an eye towards identifying spam. There's an EMail Mining Tookit available somewhere at Columbia that one can get hold of and play with.

The idea seemed to be to build a model of all email being sent amongst a group. By seeing these associations you build a whitelist of who to let through without a lot of analysis. Or something. Hey, at least I found a link for this one...

Jon Praed, Internet Law Group
"I sue spammers." (thunderous applause)

With recent legislation its getting easier to prosecute these people once you identify them. If we give them the technical means to track them down, which is still most of the work, they can bleed them dry in the courts and put them out of business. The spammers seem to all be in touch with each other buying and selling lists, contracting mail drops - it's a limited community, and they know when one of their own is taken down.

Biggest boon of the CAN SPAM act, even if it's useless as a law, is that it debunks the spammers' old assertion that the legal status of spam as protected speech protects what they do.

CAN SPAM does *not* require that ISPs carry compliant spam! It will drive spammers off-shore, and thus raise the demand for off-shore IP address blocks.

Geoff Hulten, Microsoft Research

Describes work uSoft Research has done with Hotmail. Described the Hotmail Feedback Loop, where some large number of Hotmail users volunteered and are asked to grade one message that they received each day as spam/ham. Points out that this is a very large, very diverse group scattered over the globe, not just the US. This just highlights the fact that so far English has been both the language of spam and the language of all filtering, and that this will change. (I see some non-Latin charset spam, but no stats...) The corpus of messages they've collected is now over 10MM (good and bad? bad only?).

Goeff claims they haven't tried to tie any dollar figures to the spam they're blocking. In fact, there were several questions relating to rates and exact figures that he seemed to shy away from... (shrug)

Bill Yerazunis, Mitsubishi Electric Research Labs

Bill's focus was on improving detection beyond "three nines" (99.9%) and how to get there. He saw some interesting results using - I'll probably get this wrong - a Markov weighting of spam phrases. I didn't get the details, this guy was going a mile a minute. Something about an exponential weighting of the detected phrases, like 2**2N, but I didn't write down how it's applied.

More interesting was the idea of innoculation, where one user/site that detects a new form of spam not caught by their filters could send out a copy of the new strain to a larger group. A standard format for these innoculations would allow both large and small groups of cooperating sites or users to learn from each other. Bill is co-author of an Internet Draft on the format of innoculation messages with Jonathan Zdziarski, you can find the draft here.

Bill also pointed out that we should apply the hash/sig filter again at inbox access time, to allow the innoculation to propagate. I hadn't heard of any body doing that.

Matthew Prince, www.unspam.com
"I don't sue spammers... but I teach other lawyers how to sue spammers." (thunderous applause)

We'll never get opt-in, because it imposes prior restraints on free speech, which doesn't go over too well Constitutionally. Noted that all the existing anti-spam laws were crafted or are based on laws written before spam really hit the steep part of it's growth curve in 2000-2001. But we can get around this.

Best thing in CAN SPAM: McCain ammendment, which says you can go after the entity being advertised/promoted as well as, or instead of, the spammer.

Had these suggestions for us techies:

  1. Establish identity and jurisdiction -- of you as recipient, if not of the spammer. Referred to Washington registry registry.waisp.org as an example, so that spammers are then on notice of terms they need to adhere to.
  2. Move/attack earlier in the spam lifecycle:
    1. address harvesting
    2. list sold
    3. contract to send
    4. spamming
  3. Claim your rights. Your set of corporate email addresses are a trade secret -- invoke the DMCA! Put a "no trespass" sign on you web pages, adding a prohibition against address harvesting. It's a contract, you'll establish liability under those terms. (Sounds as pernicious as EULA...)

Marty Lamb, Martian Software
www.martiansoftware.com
Slides

TarProxy tries to detect spam and make the delivery effort hurt the spammer, like spamd under OpenBSD. So it does things like hold the connection open if something smells like spam, then temp fails it. Problem is you tie up your own resources, but OTOH front-end TarProxy boxes can be very cheap... (?)

One neat idea, TarProxy allows for chains of classifiers to be invoked serially, which can each add their own headers. (Actually, DSPAM uses something like this, possibly others.)

Ken Schneider, Brightmail

Brightmail uses their on-site machinery for ISP/corporate customers to track which filters/rules are still effectively being matched. This gets fed back in so they can trim down the size of the filter set.

Three main types of filters, based on: source, content, and the "call to action" e.g. URL. Over 90% of spam has a URL, that's how they make the sale, so it's a great feature to select on.

Jonathan Zdziarski, DSPAM
http://www.nuclearelephant.com/projects/dspam

"Advanced Language Classification using Chained Tokens" I basically missed this presentation, the first one after lunch. But the web site is quite extensive...

Jonathan has co-authored an Internet Draft for the format of "innoculation" messages with Bill Yerazunis, it's here.

Miles Libbey, Yahoo!

This gent is some sort of spam czar at Yahoo! The big trends in spam during 2003 have been in hiding and avoiding detection. Subverting PCs and using the zombies to spam. Inserting random "hash busters" into the body to either avoid statistical detection or screw up message hash identifiers a la Razor. Avoiding tokenization within spam filters by using e.g. UTF-8, base64, HTML entity encodings; also white-on-white HTML text displayed in place of whitespace to confuse parsers.

Don't remember him saying much about the Domain Keys proposal, but here's a link: http://www.infoworld.com/article/03/12/05/HNyahoospam_1.html I haven't seen any details yet, but someone in-house should have it.

Eric Kidd, Dartmouth Medical School
eric . kidd (at)) pobox . com

The idea is basically to invert the usual application of a Bayesian filter and use it to identify non-spam. The features of non-spam should be more stable than spam, so this might yield a useful whitelist to apply before trying to figure out if incoming messages are spam. This has something like a network effect -- his boss cc'd a domain in a note to him, and that domain was then whitelisted for quite a while afterwards.

http://www.randomhacks.net/stories/bayesian-whitelisting.html is his original article, I think, from 2002.

Victor Mendez, Universidad Carlos III de Madrid

Following up on an idea from Brian Burton / SpamProbe, Victor and company took a CRM114 compiler and started changing window size and number of features used to tune the speed of their classifying engine. Turned out that decreasing the number of features lowered the number of false positives (which should probably be studied further...).

John Graham-Cumming, Sophos
How to Beat a Bayesian Spam Filter

This was the only talk to be "phoned in" -- the presentation was video taped and shown on the big screen. Included an animated sequence where the cartoon announced "All your email are belong to us!"

The basic idea was that, similar to cryptanalysis, if someone can establish a feedback mechanism for things getting through your Bayesian spam filter, then they can start altering their plaintext/message to see if it makes it through the filter. The feedback mechanism suggested was web bugs, since spammers are already known to use them. By establishing which messages get through, the spammer can use the feedback loop to identify "magic words" that cause their messages to be identified as ham rather than spam.

He also went on a bit about forged DSN/NDR's and how spammers are attacking challenge/response by styling their spam as a bogus c/r message with links to their sales pitch. But the feedback loop topic is covered almost exactly as he presented it at this link: http://www.usethesource.com/articles/03/05/19/1248213.shtml.

Thede Loder, Marshall Van Alstyne, Rick Wash - University of Michigan
http://www.citi.umich.edu/u/rwash/projects/spam

Basically, another proposal that we use a free-market system to cause spam to self-regulate. Let the advertiser risk some money with each message; if the recipient doesn't want the message, they keep the money. Imposes real cost on the sender, still lets them reach anybody who wants the message.

Very academic solution. Only people this would benefit is the DMMA, whose members are not (yet) spamming. Maybe the DMMA would actually be able to setup and run the missing infrastructure for this... Can't think of any other party who would benefit from doing it.

Eric Johansson, CAMRAM
http://www.camram.org/

An idea of using crypto-stamps (based on N-bit hash collisions) as part of a hybrid sender-pays system. Eric presented some ideas that lead to an incrementally deployed system that will grow into a nice "network effect" and speed adoption.

Peter Kay, TitanKey
www.titankey.com/mit

Basically a database, web interface, and policy enforcement module that allows users to set up multiple email addresses and change the policy on them as needed. For instance, create an address for the spam conference, give it out as needed. People use it. After a while spammers discover it, so the user changes the policy such that only past senders are allowed to use it henceforward. Change it back later, every day, as you like via web interface.

A very... dynamic... presentation. Policy is enforced based on envelope addressing, so the receiving MTA can reject before any data has been accepted. Policies can be very rich/complex. Allows for all kinds of parental and/or corporate control.

Eric S Raymond

CAN SPAM section 11.2 says the Feds should consult the IETF about spam labeling standards. But the IETF doesn't have any. ESR will soon post a draft, building on an earlier "MTA signposting" RFC by Carl Malamud.

ESR then went on at length about how SPF is a good idea. I still haven't read the handout. Not everyone in the crowd agreed with this conclusion, by the by...

Tonny Yu, mailshell

I didn't really follow very much of this presentation. Sorry.

Richard Jowsey, Death2Spam.net http://www.death2spam.net

This energetic and extremely tall New Zealander had at least a half-day of material on server/corporate-wide Bayesian filters, how to tune them, and how to adjust for feature/keyword drift over time. Jowsey once worked as an audio engineer of some sort, and had lots of interesting ways of looking at the mapping of features to spam/ham in terms of signal analysis.