Making a Mad Decent Email Classifier

7 min readFeb 6, 2021

Not your father’s spam filter

If you’re an engineer, you likely get a ton of emails about various things. Some of it’s recruiter spam, Github notifications, or very important emails from the boss announcing a re-org that doesn’t affect you. Some of these emails are things that require attention from you directly, others are more of an FYI notice, and the rest, make me question how good those spam filters actually are…

I subscribe to the goal of inbox zero. Every email in my box has been assessed and I made a decision to either address it, read and consider it, or ignore it. A lot of that information with which I make a decision can be gleaned from the text in the subject and the first few lines of an email. So rather than having my Gmail app ping me each time I get a new message which de-rails my flow state, I decided to see if I could train a classifier to sift through a batch of those emails and tell me which ones are important enough to merit my attention and which ones I can safely ignore without negative repercussions.

Working with Gmail
The Dark Art of Text Parsing
Classifying Stuff
Bringing it Together

Working with Gmail

I’m primarily a Gmail user so thats really the only reason I focused on this. Google puts on an API for Gmail that has decent documentation. I just wanted to grab all the unreads in an inbox for training or classification and mark them as read as I went.

You’ll need to go to their API site and get a developer account. Then follow the steps to create an app (this is how we’ll authenticate our code to Gmail’s API via OAuth). Once you have that, this code will get you started with authorizing and pulling all the unread emails.

def read_gmail():
  SCOPES = ['https://www.googleapis.com/auth/gmail.modify']
  unread_messages = []
  creds = None
  
  if os.path.exists('token.pickle'):
    with open('token.pickle', 'rb') as token:
    creds = pickle.load(token)
    if not creds or not creds.valid:
      if creds and creds.expired and creds.refresh_token:
        creds.refresh(Request())
      else:
        flow = InstalledAppFlow.from_client_secrets_file('credentials.json', SCOPES)
        creds = flow.run_local_server(port=0)
      with open('token.pickle', 'wb') as token:
        pickle.dump(creds, token)service = build('gmail', 'v1', credentials=creds)# Call the Gmail API
inbox_unreads = service.users().messages().list(userId='me',labelIds = ['INBOX', 'UNREAD']).execute()if inbox_unreads['resultSizeEstimate'] != 0:
  for unread in inbox_unreads['messages']:
    unread_messages.append(service.users().messages().get(userId='me', id=unread['id']).execute())
    service.users().messages().modify(userId='me', id=unread['id'], body={"removeLabelIds": ["UNREAD"]}).execute()  return unread_messages

One of the unfortunate things about the Gmail API is there isn’t support for batch update operations, so you’d have to go through and make a PUT request for each message to mark it as read. If I’m missing something here, let me know in the comments.

The Dark Art of Text Parsing

TL:DR — It’s hard and based on some assumptions about how I read and assess emails.

Text analysis is still one of those areas that requires a bit more academic research to get right, consistently. For example, your sentiment analyzer could have a 5-start product review that says “Not bad” or “mad decent” and a 1-star review that says “failed spectacularly” or “Amazing waste of money”. In sentiment analysis you’d have to find out which words map to positive or negative sentiment.

In the email classifier, I had to find out which words made an email more or less important. Since there’s no way of knowing how every person analyzes and assesses importance of messages, I had to make a few assumptions first.

Emails containing certain words may be consistently deemed more important than others.
The words in a subject line are more impactful on determining the importance of an email than words in the body.
The people contained in the email chain also add weight to how important an email is (if your boss is CCed it must be important right?).

Based on these assumptions, I decided to define a “document” (a row on a table of data for analysis) as the email subject + the email body + all participants in the “From” and “CC” fields.

Once I had that definition, I went through a few common text processing and noise reduction techniques. In order to reduce the noise introduced by stopwords (words like the a at and the like) I ran the subject and the body through a sieve of common stop words defined in the nltk library.

Finally because I really only want the essence of some words, I ran the documents through the Porter Stemmer algorithm which makes words like Connect Connected Connecting and Connector all look like connect since the key concept is “to connect”.

Now That Text is Parsed, How do We Organize It?

TL:DR — use TF-IDF to create a bunch of features (word counts) for each document (message).

tfidf_vect = TfidfVectorizer(ngram_range=(1, 2))X_ngrams = tfidf_vect.fit_transform(df['text'])train_messages, test_messages, train_labels, test_labels = train_test_split(X_ngrams, df['label'], test_size=0.3, random_state=42, stratify=df['label'])

A TF-IDF Vectorizer basically creates a list of words for all the emails and assigns each word a value based on how frequent the word appears in the email, and throughout all the emails. TF is Term Frequency or how many times this word appears in all the emails. IDF is Inverse Document Frequency which is a little more math-y but what you need to know is IDF measures how much information a word provides. For instance, “important” may not provide that much more information since everything might contain the word important, but “terminated” give a lot of information since in your world, only EC2s or your employment might be “terminated”.

We use 1 and 2 ngrams too, which is a way of accounting for words that may only make sense when together. If we only use a unigram, each word is looked at in isolation. So in the sentence “You got a promotion and a stock bonus” each word is looked at without context and you’d see promotion stock bonus which sorta sounds like you got stock AND a bonus. A bigram looks at each pair so you’d see promotion stock stock bonus which sounds like your bonus IS stock. Ngrams are just another way to provide contextual color to your analysis.

At the end of this, you should have a table of data where each column is a word with its associated TF-IDF value, and a label column. When first turning this on, I ingested 50 emails (select all in your inbox and mark as unread then ingest) and randomly labeled them from 0–2 where 0 was “don’t read” 1 was “read, but just FYI” and 2 was “you’re gonna need to respond somehow”.

If you’re familiar with machine learning workflows, the train_test_split bit should look familiar.

Classifying Stuff

TL:DR — Using Bayes and Random Forest with 50 randomly label data points, I got each model to produce a F1 score of around 0.35. Which isn’t terrible and was made better as I actually went back and labeled data and added more datapoints to it.

For this approach I went with the Multinomial Naïve Bayes model and the Random Forrest. I wanted something that was good with categorical or non-continuous data and supported multi-variable (well more than 2) outputs. This isn’t quite a spam/not-spam filter here but luckily the ol’ standbys still work here.

mnb = MultinomialNB()
%time mnb.fit(train_messages, train_labels)
mnbpred = mnb.predict(test_messages)print('Multinomial Naive Bayes F1 Score :', metrics.f1_score(test_labels, mnbpred, average='weighted'))# cross-validation using confusion matrixpd.DataFrame(
  metrics.confusion_matrix(test_labels, mnbpred),
  index=[['actual', 'actual', 'actual'], list(action_map.keys())],
  columns=[['predicted', 'predicted', 'predicted'],
  list(action_map.keys())]
)

This was the typical MNB approach. I didn’t dig too deep into evaluating the model so I cheated and only considered the F1 score. I was also randomly flinging labels at my data as I was pre-processing it so the best I was able to do was 0.35. I’m sure if I took the time to gather more data and actually label it according to my behavior I’d do better.

param_grid = {
  'n_estimators': [200, 500],
  'max_features': ['auto', 'sqrt', 'log2'],
  'max_depth' : [4,5,6,7,8],
  'criterion' :['gini', 'entropy']
}grid = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, refit = True, verbose = 3, n_jobs=6)
%time grid.fit(train_messages, train_labels)
pgpred = grid.predict(test_messages)print('SVM F1 Score :', metrics.f1_score(test_labels, pgpred, average='weighted'))print(grid.best_params_)pd.DataFrame(
  metrics.confusion_matrix(test_labels, pgpred),
  index=[['actual', 'actual', 'actual'], list(action_map.keys())],
  columns=[['predicted', 'predicted', 'predicted'],
  list(action_map.keys())]
)

This version runs the random forrest classifier through a grid search to evaluate the best hyper-parameters. Again the best I was able to get this to do was an F1 score of around 0.35 for what I suspect are the reasons I alluded to earlier.

Bringing it Together

There is still a fair bit to be done to make this production ready. What I learned along the way is it’s quite the uphill battle to find and label data in a way that makes sense and justify those decisions when coding a model.

For example, I was able to retrieve other factors in messages such as, time sent, number of messages in a thread, number of attachments and other information that might also be useful, and I decided that perhaps none of it was relevant and I might be wrong about that.

Also, when productionizing a ML model that is tuned to a particular user, there is some amount of onboarding friction that a user will go through. For me it was one of 2 paths: Either I accept that I have to manually label 100 emails myself for this to work better, or I had to accept the fact that this would be laughably bad for the first few weeks while it collects enough data to get to know me.

At the end of the day, I did learn a lot about text parsing and analysis vis a vis making predictions to how I would respond to certain emails. I open-sourced the Jupyter Notebook on my Github if you’d like to make your own or step through it and leave feedback. Hopefully this gives you a mad decent insight to the world of natural language processing.