Having Fun: Python and Elasticsearch, Part 2

November 07, 2014 11:10 am

Programming

In my earlier post on Elasticsearch and Python, we did a huge pile of work: we learned a bit about how to use Elasticsearch, we learned how to use Gmvault to back up all of our Gmail messages with full metadata, we learned how to index the metadata, and we learned how to query the data naïvely. While that’s all well and good, what we really want to do is to index the whole text of each email. That’s what we’re going to do today.

It turns out that nearly all of the steps involved in doing this don’t involve Elasticsearch; they involve parsing emails. So let’s take a quick time-out to talk a little bit about emails.

A Little Bit About Emails

It’s easy to think of emails as simple text documents. And they kind of are, to a point. But there’s a lot of nuance to the exact format, and while Python has libraries that will help us deal with them, we’re going to need to be aware of what’s going on to get useful data out.

To start, let’s take a look again at the raw email source we looked at yesterday a bit more completely:

$ gzcat /Users/bp/src/vaults/personal/db/2005-09/11814848380322622.eml.gz
X-Gmail-Received: 887d27e7d009160966b15a5d86b579679
Delivered-To: benjamin.pollack@gmail.com
Received: by 10.36.96.7 with SMTP id t7cs86717nzb;
        Wed, 14 Sep 2005 19:35:45 -0700 (PDT)
Received: by 10.70.13.4 with SMTP id 4mr150611wxm;
        Wed, 14 Sep 2005 19:35:45 -0700 (PDT)
Return-Path: <probablyadefunctaddressbutjustincase@duke.edu>
[...more of the same...]
Message-ID: <4328DDFA.4050903@duke.edu>
Date: Wed, 14 Sep 2005 22:35:38 -0400
From: No Longer a Student <probablyadefunctaddressbutjustincase@duke.edu>
Reply-To: probablyadefunctaddressbutjustincase@duke.edu
User-Agent: Mozilla Thunderbird 1.0.6 (Macintosh/20050716)
MIME-Version: 1.0
To: benjamin.pollack@gmail.com
Subject: Celebrating 21 years
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

    It's my birthday, Blue Devils!  At least it will be in a few days, so I
am opening my apartment, porch, and environs this Friday, Sept. 16th
to all of you for some celebration.  Come dressed up, come drunk, or
whatever...just come.  There will be plenty to drink, and for those of
you that are a wine connessiouers, Cheerwine is the closest you'll get.
Kickoff is at 10:30pm.  Pass-out is at <early Saturday morning>.  If you
have some drink preferences, let me know and we'll see what we can
snag.  In addition to that, let me know if you think you can make it,
even if only for a while, so we can judge the amount of booze that we'll
be stocking.

All you really need to know:
Friday, Sept. 16th
10:30pm-late
alcohol

This is about the simplest form of an email you can have. At the top, we have a bunch of metadata about the email itself. Notably, while these look kind of like key/value pairs, we can see that at least some values are allowed to repeat. That said, we’d like to try to merge this with the existing metadata we’ve got if we can.

There’s also the, you know, actual content of the email. In this particular case, that’s clearly just a blob of plain text, but let’s be honest: we know from experience that some emails have a lot of other things—attachments, HTML, and so on.^[1] Emails that have formatting or attachments are called multipart messages: each chunk corresponds to a different piece of the email, like an attachment, or a formatted version, or an encryption signature. For a toy tool, we don’t really need to do something special with all of the attachments and whatnot; we just want to grab as much as we can from the email itself. Since, in real life, even multipart emails have a plain text part, it’ll be good enough if we can just grab that.

Let’s make that the goal: we do care about the header values, and we’ll extract any plain text in the email, but the rest can wait for another day.

Parsing Emails in Python

So we know what we want to do. How do we do it in Python?

Well, we’ll need two things: we’ll need to decompress the .eml.gz files, and we’ll need to parse the emails. Thankfully, both pieces are pretty easy.

Python has a gzip module that trivially handles reading compressed data. Basically, wherever you’d otherwise write open(path_name, mode), you instead write gzip.open(path_name, mode). That’s really all there is to that part.

For parsing the emails, Python provides a built-in library, email, which does this tolerably well. For one thing, it allows us to easily grab out all of those header values without figuring out how to parse them. (We’ll see shortly that it also provides us a good way to get the raw text part of an email, but let’s hold that thought for a moment.)

There’s unfortunately one extra little thing: emails are not in a reliable encoding. Sure, they might claim they’re in something consistent, like UTF-7, but you know they aren’t. This is a bit of a problem, because Elasticsearch is going to want to be handed nothing but pure, clean Unicode text.

For the purposes of a toy, it’ll be enough if we just make a good-faith effort to grab useful information out, even if it’s lossy. Since most emails are sent in Latin-1 or ASCII encoding, we can be really, really lazy about this by introducing a utility function that tries to decode strings as Latin-1, and just replaces anything it doesn’t recognize with the Unicode unknown character symbol, �.

def unicodish(s):
    return s.decode('latin-1', errors='replace')

With that in mind, we can start playing with these modules immediately. In your Python REPL, try something like this:

import email
import gzip

with gzip.open('/path/to/an/email.eml.gz', 'r') as fp:
    message = email.message_from_file(fp)
print '%r' % (message.items(),)

This looks awesome. The call to email.message_from_file() gives us back a Message object, and all we have to do to get all the header values is to call message.items().

All that’s left for this part is to merge the email headers with the Gmail metadata, so let’s do that first. While values can repeat, we don’t actually care; key fields, like From and To, don’t, and if we accidentally only end up with one Received field when we should have fifteen, we don’t care. This is, after all, something we’re hacking together for fun, and I’ve never in my life cared to query the Received field. This gives us an idea for a way to quickly handle things: we can just combine the existing headers with our current metadata.

So, ultimately, we’re really just changing our original metadata loading code from

with open(path.join(base, name), 'r') as fp:
    meta = json.load(fp)

with gzip.open(path.join(base, name.rsplit('.', 1)[0] + '.eml.gz', 'r') as fp:
    message = email.message_from_file(fp)
meta = {unicodish(k).lower(): unicodish(v) for k, v in message.items()}
with open(path.join(base, name), 'r') as fp:
    meta.update(json.load(fp))

Not bad for a huge feature upgrade.

Note that this prioritizes Gmail metadata over email headers, which we want: if some email has an extra, non-standard Label header, we don’t want it to trample our Gmail labels. We’re also normalizing the header keys, making them all lowercase, so we don’t have to deal with email clients that secretly write from and to instead of From and To.

That’s it for headers. Give it a shot: try running your modified loader script, and then querying it using the query tool we wrote last time with the --raw-result flag we added to our query tool last time. We’re not printing something useful and user-friendly with the new data, but it’s already searchable and useful.

In fact, you know what? Sure, this is a toy, but it’s just not honestly hard to make this print out at least a little more useful data. Just having From and To would be helpful, so let’s quickly tweak the tool to do that by altering the final click.echo() call:

#!/usr/bin/env python

import json

import click
import elasticsearch


@click.command()
@click.argument('query', required=True)
@click.option('--raw-result/--no-raw-result', default=False)
def search(query, raw_result):
    es = elasticsearch.Elasticsearch()
    matches = es.search('mail', q=query)
    hits = matches['hits']['hits']
    if not hits:
        click.echo('No matches found')
    else:
        if raw_result:
            click.echo(json.dumps(matches, indent=4))
        for hit in hits:
            # This next line and the two after it are the only changes
            click.echo('To:{}\nFrom:{}\nSubject:{}\nPath: {}\n\n'.format(
                hit['_source']['to'],
                hit['_source']['from'],
                hit['_source']['subject'],
                hit['_source']['path']
            ))

if __name__ == '__main__':
    search()

Bingo, done. Not bad for a three-line edit.

For the body itself, we need to do something a little bit more complicated. As we discussed earlier, emails can be simple or multipart, and Python’s email module unfortunately exposes that difference to the user. For simple emails, we’ll just grab the body, which will likely be plain text. For multipart, we’ll grab any parts that are plain text, smash them all together, and use that for the body of the email.

So let’s give it a shot. I’m going to pull out the io module so we can access StringIO for efficient string building, but you could also just do straight-up string concatenation here and get something that would perform just fine. Our body reader then is going to look something like this:

content = io.StringIO()
if message.is_multipart():
    for part in message.get_payload():
        if part.get_content_type() == 'text/plain':
            content.write(unicodish(part.get_payload()))
else:
    content.write(unicodish(message.get_payload()))

This code simply looks for anything labeled plain text and builds a giant blob of it, handling the plain case and the multipart case differently.^[2]

Well, if you think about it, we’ve done all the actual parsing we need to do. That just leaves Elasticsearch integration. We want to combine this with the metadata parsing we already had, so our final code for indexing will look like:

def parse_and_store(es, root, email_path):
    gm_id = path.split(email_path)[-1]
    with gzip.open(email_path + '.eml.gz', 'r') as fp:
        message = email.message_from_file(fp)
    meta = {unicodish(k).lower(): unicodish(v) for k, v in message.items()}
    with open(email_path + '.meta', 'r') as fp:
        meta.update(json.load(fp))

    content = io.StringIO()
    if message.is_multipart():
        for part in message.get_payload():
            if part.get_content_type() == 'text/plain':
                content.write(unicodish(part.get_payload()))
    else:
        content.write(unicodish(message.get_payload()))

    meta['account'] = path.split(root)[-1]
    meta['path'] = email_path

    body = meta.copy()
    body['contents'] = content.getvalue()
    es.index(index='mail', doc_type='message', id=gm_id, body=body)

That’s it. On my system, this can index every last one of the tens of thousands of emails I’ve got in only a minute or so, and the old query tool we wrote can easily search through all of them in tens of milliseconds.

Making a Real Script

Last time, we used click to make our little one-off query tool have a nice UI. Let’s do that for the data loader, too. All we really need to do is make that ad-hoc parse_and_store function be a real main function. The result will look like this:

#!/usr/bin/env python

import email
import json
import gzip
import io
import os
from os import path

import click
import elasticsearch


def unicodish(s):
    return s.decode('latin-1', errors='replace')


def parse_and_store(es, root, email_path):
    gm_id = path.split(email_path)[-1]

    with gzip.open(email_path + '.eml.gz', 'r') as fp:
        message = email.message_from_file(fp)
    meta = {unicodish(k).lower(): unicodish(v) for k, v in message.items()}
    with open(email_path + '.meta', 'r') as fp:
        meta.update(json.load(fp))

    content = io.StringIO()
    if message.is_multipart():
        for part in message.get_payload():
            if part.get_content_type() == 'text/plain':
                content.write(unicodish(part.get_payload()))
    else:
        content.write(unicodish(message.get_payload()))

    meta['account'] = path.split(root)[-1]
    meta['path'] = email_path

    body = meta.copy()
    body['contents'] = content.getvalue()

    es.index(index='mail', doc_type='meta', id=gm_id, body=meta)
    es.index(index='mail', doc_type='message', id=gm_id, body=body)


@click.command()
@click.argument('root', required=True, type=click.Path(exists=True))
def index(root):
    """imports all gmvault emails at ROOT into INDEX"""
    es = elasticsearch.Elasticsearch()
    root = path.abspath(root)
    for base, subdirs, files in os.walk(root):
        for name in files:
            if name.endswith('.meta'):
                parse_and_store(es, root, path.join(base, name.split('.')[0]))

if __name__ == '__main__':
    index()

Until Next Time

For now, you can see that what we’ve got works by using the old query tool with the --raw-result flag, and you can use it to do queries across all of your stored email. But the query tool is lacking, and in multiple ways: it doesn’t output everything we care about (specifically, a useful bit of the message bodies), and it doesn’t really work the way we want by treating some fields (like labels) as exact matches. We’ll fix these next time, but for now, we can rest knowing that we’re successfully storing everything we care about. Everything else is going to be UI.

After all, if you can’t attach a Word document containing your cover letter to a blank email saying “Job Application”, what’s the point of email? ↩︎
I actually think the Python library messes this up: simple emails and multipart emails really ought to look the same to the developer, but unfortunately, that’s the way the cookie crumbled. ↩︎