Having Fun: Python and Elasticsearch, Part 1

November 04, 2014 9:03 am

Programming

I find it all too easy to forget how fun programming used to be when I was first starting out. It’s not that a lot of my day-to-day isn’t fun and rewarding; if it weren’t, I’d do something else. But it’s a different kind of rewarding: the rewarding feeling you get when you patch a leaky roof or silence a squeaky axle. It’s all too easy to get into a groove where you’re dealing with yet another encoding bug that you can fix with that same library you used the last ten times. Yet another pile of multithreading issues that you can fix by rewriting the code into shared-nothing CSP-style. Yet another performance issue you can fix by ripping out code that was too clever by half with Guava.

As an experienced developer, it’s great to have such a rich toolbox available to deal with issues, and I certainly feel like I’ve had a great day when I’ve fixed a pile of issues and shipped them to production. But it just doesn’t feel the same as when I was just starting out. I don’t get the same kind of daily brain-hurt as I did when everything was new,^[1] and, sometimes, when I just want to do something “for fun”, all those best practices designed to keep you from shooting yourself (or anyone else) in the foot just get in the way.

Over the past several months, Julia Evans has been publishing a series of blog posts about just having fun with programming. Sometimes these are “easy” topics, but sometimes they’re quite deep (e.g., You Can Be a Kernel Hacker). Using her work as inspiration, I’m going to do a series of blog posts over the next couple of months that just have fun with programming. They won’t demonstrate best practices, except incidentally. They won’t always use the best tools for the job. They won’t always be pretty. But they’ll be fun, and show how much you can get done with quick hacks when you really want to.

So, what’ll we do as our first project? Well, for awhile, I’ve wanted super-fast offline search through my Gmail messages for when I’m traveling. The quickest solution I know for getting incredibly fast full-text search is to whip out Elasticsearch, a really excellent full-text search engine that I used to great effect on Kiln at Fog Creek.^[2]

We’ll also need a programming language. For this part of the series, I’ll choose Python, because it strikes a decent balance between being flexible and being sane.^[3]

I figure we can probably put together most of this in a couple of hours spread over the course of a week. So for our first day, let’s gently put best practices on the curb, and see if we can’t at least get storage and maybe some querying done.

Enough Elasticsearch to Make Bad Decisions

I don’t want to spend this post on Elasticsearch; that’s really well handled elsewhere. What you should do is read the first chapter or two of Elasticsearch: the Definitive Guide. And if you actually do that, skip ahead to the Python bit. But if you’re not going to do that, here’s all you need to know about Elasticsearch to follow along.

Elasticsearch is a full-text search database, powered by Lucene. You feed it JSON documents, and then you can ask Elasticsearch to find those documents based on the full-text data within them. A given Elasticsearch instance can have lots of indexes, which is what every other database on earth calls a database, and each index can have different document types, which every other database on earth calls a table. And that’s about it.

“Indexing” (storing) a document is really simple. In fact, it’s so simple, let’s just do it.

First, if you haven’t already, install the Python library for Elasticsearch using pip via a simple pip install elasticsearch, and then launch an instance of a Python. I like bpython for this purpose, since it’s very lightweight and provides great tab completion and as-you-type help, but you could also use IPython or something else. Next, if you haven’t already, grab a copy of Elasticsearch and fire it up. This involves the very complicated steps of

Downloading Elasticsearch;
Extracting it; and
Launching it by firing up the bin/elasticsearch script in a terminal.

That’s it. You can make sure it’s running by hitting http://localhost:9200/ in a web browser. If things are looking good, you should get back something like

{
  "status" : 200,
  "name" : "Gigantus",
  "version" : {
    "number" : "1.3.4",
    "build_hash" : "a70f3ccb52200f8f2c87e9c370c6597448eb3e45",
    "build_timestamp" : "2014-09-30T09:07:17Z",
    "build_snapshot" : false,
    "lucene_version" : "4.9"
  },
  "tagline" : "You Know, for Search"
}

Then, assuming you are just running a vanilla Elasticsearch instance, give this a try in your Python shell:

import elasticsearch
es = elasticsearch.Elasticsearch()  # use default of localhost, port 9200
es.index(index='posts', doc_type='blog', id=1, body={
    'author': 'Santa Clause',
    'blog': 'Slave Based Shippers of the North',
    'title': 'Using Celery for distributing gift dispatch',
    'topics': ['slave labor', 'elves', 'python',
               'celery', 'antigravity reindeer'],
    'awesomeness': 0.2
})

That’s it. You didn’t have to create a posts index; Elasticsearch made it when you tried storing the first document there. You likewise didn’t have to specify what the document schema was; Elasticsearch just infered it, based on the first document you provided.^[4]

Want to store more documents? Just repeat the process:

es.index(index='posts', doc_type='blog', id=2, body={
    'author': 'Benjamin Pollack',
    'blog': 'bitquabit',
    'title': 'Having Fun: Python and Elasticsearch',
    'topics': ['elasticsearch', 'python', 'parseltongue'],
    'awesomeness': 0.7
})
es.index(index='posts', doc_type='blog', id=3, body={
    'author': 'Benjamin Pollack',
    'blog': 'bitquabit',
    'title': 'How to Write Clickbait Titles About Git Being Awful Compared to Mercurial',
    'topics': ['mercurial', 'git', 'flamewars', 'hidden messages'],
    'awesomeness': 0.95
})

Getting documents is just as easy. E.g., how can we see the post we just indexed?

es.get(index='posts', doc_type='blog', id=2)

Of course, this is boring; we’re using Elasticsearch as a really bizarre key/value store, but the whole point of Elasticsearch is to allow you to, well, search. So let’s do that.

Elasticsearch provides two different ways to search documents. There’s a structured query language, which allows you to very carefully and unambiguously specify complex queries; and there’s a simple, Lucene-based syntax that is great for hacking things together. For the moment, let’s just play with the Lucene-based one. What’s that look like? Well, if you wanted to find all posts where I was the author, you could simply do

es.search(index='posts', q='author:"Benjamin Pollack"')

So all we have to do is write field:value and we get search. You could also just do something like

es.search(index='posts', q='Santa')

to search across all fields, or mix and match:

es.search(index='posts', q='author:"Benjamin Pollack" python')

It’s just that simple.^[5]

And, hey, this seems really close to Gmail’s search syntax. Maybe that’ll come in handy later.

Getting the Gmail Data

So with Elasticsearch under our belt, let’s look at actually coding up how to get the Gmail data! We’ll need to write an IMAP client, and then de-duplicate messages due to labels, and figure out what the labels are, and…

…or, better yet, let’s not, because someone else already figured that part out for us. Gmvault already allows mirroring your Gmail account, including all the special stuff, like which labels are on which emails and so on. So let’s just install and use that. You can install it with a simple pip install gmvault==1.8.1-beta --allow-external IMAPClient^[6], and then you can sync your email with a simple

gmvault sync you@gmail.com -d path/to/where/you/want/the/email/archived

Not only will this sync things for you; it’ll do it with proper OAuth semantics and everything. So that takes care of getting the emails for offline access. Next up, let’s figure out how to start getting data into Elasticsearch.

Loading the Metadata

Once Gmvault finishes downloading your emails, if you go poke around, you’ll see there’s a really simple structure going on in the downloaded data. Assuming you synced to ~/gmails, then you’ll see something like:

~/gmails/db/2005-09/118148483803226229.meta
~/gmails/db/2005-09/118148483803226229.eml.gz
~/gmails/db/2007-03/123168411054578126.meta
~/gmails/db/2007-03/123168411054578126.eml.gz
    ...

This looks really promising. I wonder what format those .metas are?

$ cat 2007-03/123168411054578.meta | python -mjson.tool
{
    "flags": [
        "\\Seen"
    ],
    "gm_id": 123168411054578,
    "internal_date": 1174611101,
    "labels": [
         "Registrations"
    ],
    "msg_id": "19b71702474dd770796e8aa45d@www.rememberthemilk.com",
    "subject": "Welcome to Remember The Milk!",
    "thread_ids": 123168411054578,
    "x_gmail_received": null
}

Perfect! Elasticsearch takes JSON, and these are already JSON, so all we have to do is to submit these to Elasticsearch and we’re good. Further, these have a built-in ID, gm_id, that matches the file name of the actual email on disk, so we’ve got a really simple mapping to make this all work.

And what are the .eml.gz files?

$ gzcat 2005-09/11814848380322.eml.gz
X-Gmail-Received: 887d27e7d009160966b15a5d86b579679
Delivered-To: benjamin.pollack@gmail.com
Received: by 10.36.96.7 with SMTP id t7cs86717nzb;
        Wed, 14 Sep 2005 19:35:45 -0700 (PDT)
Received: by 10.70.13.4 with SMTP id 4mr150611wxm;
        Wed, 14 Sep 2005 19:35:45 -0700 (PDT)
Return-Path: <probablyadefunctaddressbutjustincase@duke.edu>

Okay, so good news, that’s the email that goes with the metadata, bad news, parsing emails in Python sucks. For today, let’s start by indexing just the metadata, and then spin back around to handle the full email text later.

To make this work, we really only need three tools:

os.walk, which lets us walk a directory hierarchy;
json, the Python module for working with JSON; and
elasticsearch, the Python interface for Elasticsearch we already discussed earlier.

We’ll walk all the files in the root of the Gmvault database using os.walk, find all files that end in .meta, load the JSON in those files, tweak the JSON just a bit (more on that in a second), and then shove the JSON into Elasticsearch.

for base, subdirs, files in os.walk('/home/you/gmails'):
    for name in files:
        if name.endswith('.meta'):
            with open(path.join(base, name), 'r') as fp:
                meta = json.load(fp)
            meta['account'] = path.split(root)[-1]
            meta['path'] = path.join(base, name)
            es.index(index='mail', doc_type='message', id=meta['gm_id'], body=meta)

And that’s seriously it. Elasticsearch will automatically create the index and the document type based on the first document we provide it, and will store everything else. On my machine, this chews through tens of thousands of meta files in just a couple of seconds.

I did throw in two extra little bits to help us later: I explicitly track the path to the metadata (that’s the meta['path'] = path.join(base, name) bit), and I explicitly set the account name based on the path so we can load up multiple email accounts if we want (that’s the meta['account'] = path.split(root)[-1] part). Otherwise, I’m just doing a vanilla es.index() call on the raw JSON we loaded.

So far, so good. But did it work?

Searching the Metadata

We can start by being really lazy. As noted earlier, Elasticsearch provides two search mechanisms: a structured query language and a string-based API that just passes your query onto Lucene. For production apps, you pretty much always want to use the first one; it’s more explicit, more performant, and less ambiguous.

But this isn’t a production app, so let’s be lazy. As you know if you read the Elasticsearch section first, doing a Lucene-based query is this monstrosity:

es.search('mail', q=query)

Yep. That’s all it takes to do a full-text search using the built-in Lucene-backed query syntax. Try it right now. You’ll see you get back a JSON blob (already decoded into native Python objects) with all of the results that match your query. So all we’d really have to do to exceed our goal for today, to have both storing data and querying, would be to whip up a little command-line wrapper around this simple command call.

When I think of writing command-line tools in Python, I think click, a super-simple library for whipping up great command-line parsing. Let’s bring in click via pip install click, and use it to write our tool.

All we have to do for this tool is allow passing a Lucene-like string to Elasticsearch. I’ll also add an extra command-line parameter for printing the raw results from Elasticsearch, since that’ll be useful for debugging.

Here’s all it takes for the first draft of our tool:

#!/usr/bin/env python

import json

import click
import elasticsearch


@click.command()
@click.argument('query', required=True)
@click.option('--raw-result/--no-raw-result', default=False)
def search(query, raw_result):
    es = elasticsearch.Elasticsearch()
    matches = es.search('mail', q=query)
    hits = matches['hits']['hits']
    if not hits:
        click.echo('No matches found')
    else:
        if raw_result:
            click.echo(json.dumps(matches, indent=4))
        for hit in hits:
            click.echo('Subject:{}\nPath: {}\n\n'.format(
                hit['_source']['subject'],
                hit['_source']['path']
            ))

if __name__ == '__main__':
    search()

Let’s walk through this slowly. The initial @click.command() tells Click that the following function is going to be exposed as a command-line root. @click.argument and @click.option allow us to specify mandatory arguments and optional parameters, respectively, just by naming them. Then we can declare our function, make sure that our function arguments match the command-line arguments, and Click takes care of the rest.

The actual body of the function has no real surprises, but just to go over things: the es.search() call we’ve already discussed. We don’t care about any of the result metadata that Elasticsearch provides, so we simply explicitly grab matches['hits']['hits'] early on;^[7] and, by default, Elasticsearch stores the entire document when you index it, so instead of loading the original document explicitly, we can be really lazy and just look at the _source dictionary key in the hit.^[8]

Aaaand we’ve already got full-text and axis-based searching across message metadata, including a nice command-line client. In less than an hour.

In the next post, we’ll explore parsing emails in Python and doing full-text search through the whole body. For today, we’ve already got lightening-fast search across message metadata.

While it’s true that picking up my fifteenth package management tool certainly makes my brain hurt, that’s not quite what I mean here. ↩︎
Sure, something like Xapian might make more sense, but it’s harder and I don’t know it very well, and this is for fun, so who cares. ↩︎
Besides which, this particular example is heavily adapted from a talk I gave at NYC Python a few weeks back, so Python already had one foot in the door. You should come to NYC Python if you’re in the New York area; it’s a great group. ↩︎
You can, and in production usually need to, specify an explicit schema. Elasticsearch calls these mappings. However, the defaults are usually totally fine for messing around, so I’m not going to explore mappings here at all. ↩︎
For some queries, you may notice that Elasticsearch returns more documents than you think ought to match. If you look, you’ll see that, while it’s returning them, it thinks they’re really lousy results: their _score property will be very low, like 0.004. We’ll explore ways to mitigate this later, but note that this is similar to search engines like DuckDuckGo, Google, and Bing trying very hard to find you websites when your terms just don’t honestly match very much. ↩︎
IMAPClient unfortunately is not hosted on PyPI, so you’ll need to explicitly enable fetching it from non-approved sources to proceed. ↩︎
The upper parts of the dictionary contain really useful metadata about the results, but we don’t care about them for our work here, so we’ll just throw them out. ↩︎
See the previous footnote about mappings? This is part of why production apps almost always need them: mappings allow you to disable storing the full document, since you probably have the original stored elsewhere anyway. But, again, it’s actually really handy for debugging and toy projects. ↩︎