Machine Learning: Filtering Email for Spam or Ham

in Development

You may have seen our previous posts on machine learning — specifically, how to let your code learn from text and working with stop words, stemming, and spam. So today, we’re going to build our machine learning-based spam filter, using the tools we walked through in those posts: tokenizer, stemmer, and naive bayes classifier.

We are going to work with bluebird promise library here, so if you are not used to promises, please take a look at the bluebird API reference.

Training and Testing Dataset

Before we begin, it’s important to have good training data. You can download some here — we are interested in two.

  • TR-mails.zip, the raw emails’ corpus
  • spam-mail.tr, the correct labels (spam or ham) associated to each training email in TR-mails.zip, where each line tells us the correct label for the specified email id.

1,0: means email TRAIN_1.eml is spam 3,1: means email TRAIN_3.eml is ham (not spam)

Now we need to write some code that loads the labels file and trains the classifier using the raw training emails.

const { readFile } = require('fs');
const { promisify } = require('bluebird');
const readFileAsync = promisify(readFile);

getLabelsForIds("data/spam-mail.tr.label")
.then((emails) => {
  console.log(emails);
  process.exit(0);
})
.catch((error) => {
  console.error(error);
  process.exit(1);
});

This code loads the labels from spam-mail.tr in the form of array of objects.

{id: email_id, label: "spam"/"ham"}
$ node --harmony train.js
>> [ { id: '1', label: 'spam' },
  { id: '2', label: 'spam' },
  { id: '3', label: 'ham' },
  { id: '4', label: 'spam' },
  … ]

Now that we have our ids and labels, we need to load the email corpus.

Raw Email Files

The files TRAIN_id.eml are raw email files. They include all the information about the email: from, cc, to, subject, body, etc.

To train the classifier, we need the clean sender address, subject, and body (text/HTML). To parse these .eml files, we can use the mailparser module.

const { createReadStream } = require('fs');
const { MailParser } = require('mailparser');
function parseEmail(emailPath) {
  return new Promise((resolve, reject) => {
    // the mail parser
    const parser = new MailParser();
    // opening email
    const emailStream = createReadStream(emailPath);
    // error opening the email
    emailStream.on('error', reject);
    // finished parsing. mail is an email object
    parser.on('end', resolve);
    // piping the email stream into the parser
    emailStream.pipe(parser);
  });
}

The function returns a promise we can now use to get the parsed email.

parseEmail(`data/TR/TRAIN_1.eml`)
.then((mail) => {
  console.log(`${mail.from[0].name} <${mail.from[0].address}>: ${mail.subject}`);
})
Ben Green <bengreen@mindupmerchants.com>: One of a kind Money maker! Try it for free!

What we need are the properties I’ve mentioned above: from, subject, text, and html.

Sometimes the email is pure text, and other times it could be HTML. When the email is raw text, the html property is undefined, and when the email is formatted HTML, then the text property is undefined. In the training email with id 1, the body is HTML defined on the html property, so the text property is undefined.

HTML Emails

Sometimes it’s better to get the pure text from the HTML, rather than use the nested HTML, as training data.

Let’s use the unfluff module to extract pure text from the HTML. Adding unfluff to the email parser is simple.

const unfluff = require('unfluff');
parser.on('end', (mail) => {
  if (mail.html) {
    const result = unfluff(mail.html);
    mail.text = result.text && result.text !== '' ?
        result.text : mail.html;
  }
  resolve(mail);
});

Now the mail object should always have a text property.

Train

Let’s now combine all the concepts we’ve seen so far. Using promises, it is quite easy to chain these different steps:

  1. Load train email ids and labels
  2. Parse the first email
  3. Tokenize and stem the text (prepending sender address and email subject)
  4. Train the classifier
// 1st step
getLabelsForIds('data/spam-mail.tr.label')
.then((emails) => {
  // save the first email id and label
  const [{ id, label }] = emails;

  // 2nd step
  return parseEmail(`data/TR/TRAIN_${id}.eml`)
  .then ((mail) => {
    // prepend the sender address and the subject to the text
    // so the classifier has all the useful information it needs
    const text = `${mail.from[0].address} ${mail.subject} ${mail.text}`;

    // tokenizeAndStem() returns an array of stemmed tokens
    // join them to have a string
    const trainText = text.tokenizeAndStem().join(' ');

    // train the classifier with text and label
    return cls.train(trainText, label);
  });
});

Chaining the Promises

We now want to automate the training process running through all the training emails. To do that, we need to chain the emails.

First, let’s create a function that trains a single email, passing it the mail id, label, and classifier as arguments.

function trainEmail(id, label, cls) {
  return parseEmail(`data/TR/TRAIN_${id}.eml`)
  .then ((mail) => {
    const text = `${mail.from[0].address} ${mail.subject} ${mail.text}`;
    const trainText = text.tokenizeAndStem().join(' ');
    return cls.train(trainText, label);
  });
}

Where data/TR/ is the directory where we have the training emails.

The function returns a promise for each email training. Let’s iterate through the promises using bluebird’s each() method.

const { each } = require('bluebird');
function trainAllEmails(emails, cls) {
  return each(emails, (email) => {
    const { id, label } = email;
    return trainEmail(id, label, cls);
  });
}

Okay — we are finally here. With this last function, we train the classifier with all the training emails we have. To let it know when it’s finished, we can append another then() to it:

const { Bayesian } = require('classifier');
const cls = new Bayesian();
getLabelsForIds('data/spam-mail.tr.label')
.then((emails) => {
  return trainAllEmails(emails, cls);
})
.then(() => {
  console.log('training finished');
});

Classify Test Emails

The site where we downloaded the training data also provides some test data — a mix of spam and ham emails with no labels.

The file is TT-mails.zip, and I’m placing the TEST_id.eml emails in the data/TT/ directory.

We’ve seen all we need to easily classify a test email. We can reuse the parseEmail() function to build our new classifyEmail() function. In the end, we just need the parsed email address, subject, and text to pass to our classifier and print the label.

function classifyEmail(id, cls) {
  return parseEmail(`data/TT/TEST_${id}.eml`)
  .then((mail) => {
    const text = `${mail.from[0].address} ${mail.subject} ${mail.text}`;
    const classifyText = text.tokenizeAndStem().join(' ');
    return cls.classify(classifyText);
  });
}

We can now use the classifyEmail() function after the training to classify the first test email.

.then(() => {
  console.log('training finished');
  return classifyEmail(1, cls);
})
.then((label) => {
  console.log(`TEST_1 is ${label}`);
});

Pay attention: The classifier must be the same cls instance we trained!

>> Training finished
>> TEST_1 is ham

It definitely looks like ham!

Now let’s create the classifyEmailsRange() that will classify a number of count emails from the email id from.

const { map } = require('bluebird');
// Create an array containing the numbers from x to n
function range(n, x = 1) {
  return Array.from(Array(n), (v, k) => k + x);
}
// Classify the emails from start to start + count
function classifyEmailsRange(cls, start, count) {
  return map(range(count, start), (id) => {
    return classifyEmail(id, cls);
  });
}

To get the labels of the test classified emails from the id 1 through 15, we use this code once the classifier is trained.

.then(() => {
  console.log('training finished');
  return classifyEmailsRange(cls, 1, 15);
})
.then((results) => {
  console.log(results);
});

And we get:

Accuracy and Some Considerations

Take a look at the emails. We have decent accuracy, but the result is not 100% correct.

Test emails 4, 7, and 14 look quite spammy to me. So we have an accuracy around 70% over 14 emails, which shows you how powerful these tools are, but I would not sell it as spam filter software. The cool thing about this classifier is that it can be trained, so we could say to it “emails 4, 7 and 14 are spam” by using them in our training set.

But before that, there is a lot more information we could use inside the raw email files. The from information is useless, unfortunately, and it could be set manually by the user. Hopefully the received property is inside the headers, which could be helpful in tracing spam. If you’d like to learn more, check out this article on how to use the received header and improve your filter’s accuracy.

Do you have experience working with machine learning? Be sure to let us know any tips and tricks you might have in the comments below!

Code School

Code School teaches web technologies in the comfort of your browser with video lessons, coding challenges, and screencasts. We strive to help you learn by doing.

Visit codeschool.com

About the Author

Alvise Susmel

Alvise Susmel

Alvise Susmel is Software Architect in a London-based Hedge Fund, working on a big-data platform which analyses and processes crucial investments’ data. He loves learning new technologies, sharing the passion and building services, like poetic.io.

Might We Suggest