Ad-hoc querying on AWS: Connecting BI tools to Athena

In a previous post, we discussed using Lambda, Glue and Athena to set up queries of events that are logged by our real-time bidding system. Here, we will build on that foundation, and show how to make this even friendlier to business users by connecting BI tools to this setup.

Luckily, Athena supports both JDBC and ODBC, and, thus, any BI tool that uses either of these connection methods can use Athena!

First, we need to create an IAM user. The the minimum policies required are:

  • AmazonAthenaFullAccess
  • Writing to a bucket for Athena query output (use an existing one or create a new one). For the sake of example, let’s call it s3://athena.out
  • Reading from our s3://logbucket which is where the logs are in

Now we’ll need the access and the secret keys for that user to use it with various tools. 

JDBC

JDBC driver (com.simba.athena.jdbc.Driver) can be downloaded here.

The JDBC URL is constructed as follows:

jdbc:awsathena://User=<aws-access-key>;Password=<aws-secret-key>;S3OutputLocation=s3://athena.out;

Here’s a sample Java program that shows it in action:

import java.sql.Connection;
import java.sql.DriverManager;
 
public class Main {
  public static void main(String[] args) throws Throwable {
    Class.forName("com.simba.athena.jdbc.Driver");
    String accessKey = "...";
    String secretKey = "...";
    String bucket = "athena.out";
    String url = "jdbc:awsathena://AwsRegion=us-east-1;User=" + accessKey + ";Password=" + secretKey
        + ";S3OutputLocation=s3://" + bucket +”;";
    Connection connection = DriverManager.getConnection(url);
    System.out.println("Successfully connected to\n\t" + url);
  }
}

Example using JDBC: DbVisualizer

  1. If you haven’t already, download the JDBC driver to some folder.
  2. Open Driver Manager (Tools-Driver Manager)
  1. Press green + to create new driver
  1. Press the folder icon on the right …
  1. … and browse to the folder where you saved the JDBC driver and select it:
  1. Leave the URL Format field blank and pick com.simba.athena.jdbc.Driver for Driver Class:
  1. Close the Driver Manager, and let’s create a Connection:
  1. We’ll use “No Wizard” option. Pick Athena from the dropdown in the Driver (JDBC) field and enter the JDBC URL from above in the Database URL field:
  2. Press “Connect” and observe DbVisualizer read the metadata information from Athena (well, Glue, really), including tables and views.

ODBC (on OSX)

  1. Download run ODBC driver installer
  2. Create or edit /Library/ODBC/odbcinst.ini to add the following information:
    [ODBC Drivers]
    Simba Athena ODBC Driver=Installed
    [Simba Athena ODBC Driver]
    Driver = /Library/simba/athenaodbc/lib/libathenaodbc_sbu.dylib

    If the odbcinst.ini file already has entries, put new entries into the appropriate sections; e.g., if it was
    [ODBC Drivers]
    PostgreSQL Unicode = Installed
    [PostgreSQL Unicode]
    Description = PostgreSQL ODBC driver
    Driver = /usr/local/lib/psqlodbcw.so

    Then it becomes
     [ODBC Drivers]
    PostgreSQL Unicode = Installed
    Simba Athena ODBC Driver=Installed
    [PostgreSQL Unicode]
    Description = PostgreSQL ODBC driver
    Driver = /usr/local/lib/psqlodbcw.so
    [Simba Athena ODBC Driver]
    Driver = /Library/simba/athenaodbc/lib/libathenaodbc_sbu.dylib
  3. Create or edit, in a similar fashion, /Library/ODBC/odbc.ini to include the following information:
    [AthenaDSN]
    Driver=/Library/simba/athenaodbc/lib/libathenaodbc_sbu.dylib
    AwsRegion=us-east-1
    S3OutputLocation=s3://athena.out
    AuthenticationType=IAM Credentials
    UID=AWS_ACCESS_KEY
    PWD=AWS_SECRET_KEY
  4. If you wish to test, download and run ODBC Manager. You should see that it successfully recognizes the DSN:

Example using ODBC: Excel

  1. Switch to Data tab, and under New Database Query select From Database:
  1. In the iODBC Data Source Chooser window, select AthenaDSN we configured above and hit OK. 
  1. Annoyingly, despite having configured it, you will be asked for credentials again. Enter the access and secret key.
  2. You should see a Microsoft Query window. 

Success!

Helpful links

Ad-hoc querying on AWS: Lambda, Glue, Athena

Introduction

If you give different engineers the same problem they will usually produce reasonably similar solutions (mutatis mutandis). For example, when I first came across a reference implementation of an RTB platform using AWS, I was amused by how close it was to what we have implemented in one of my previous projects (OpenRTB).

So it would be not much of a surprise that in the next RTB system, a similar pattern was used: logs are written to files, pushed to S3, then aggregated in Hadoop from where the reports are run.

But there were a few problems in the way… 

Log partitioning

The current log partitioning in S3 is by server ID.  This is really useful for debugging, and is fine for some aggregations, but not really good for various reasons – it is hard to narrow things by date, resulting in large scans. It is, therefore, even harder to do joins. Large scans in tools like Athena also translate into larger bills. In short, Hive-like partitioning would be good. 

To that end, I’ve created a Lambda function, repartition, which is triggered when a new log file is uploaded to s3://logbucket/ bucket:

import boto3
from gzip import GzipFile
from io import BytesIO
import json
import urllib.parse

s3 = boto3.client('s3')

SUFFIX = '.txt.gz'

V = "v8"

def lambda_handler(event, context):
    #print("Received event: " + json.dumps(event, indent=2))

    # Get the object from the event and show its content type
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
    try:
        s3obj = s3.get_object(Bucket=bucket, Key=key)
        src = { 'Bucket': bucket, 'Key': key }
        (node, orig_name) = key.split("/")
        (_, _, node_id) = node.split("_")
        name = orig_name.replace(SUFFIX, "")
        (evt, dt0, hhmmss) = name.split("-")
        hr = hhmmss[0:2]
        # date-hour
        dthr = f"year=20{dt0[0:2]}/month={dt0[2:4]}/day={dt0[4:6]}/hour={hr}"
        
        schema = f"{V}/{evt}/{dthr}"
        dest = f"{schema}/{name}-{node_id}{SUFFIX}"
        print(f"Copying {key} to {dest}")
        s3.copy_object(Bucket=bucket, Key=dest, CopySource=src)

        return "OK"
    except Exception as e:
        print(e)
        print('Error getting object {} from bucket {}. Make sure they exist and your bucket is in the same region as this function.'.format(key, bucket))
        raise e

if __name__ == "__main__":
    # Wrapper to run from CLI for now
    s3entry = {'bucket' : {'name' : 'logbucket'},
               'object' : {'key'  : server/requests-200420-003740.txt.gz'}}
    event = {'Records' : [{'s3' : s3entry}] }
    lambda_handler(event, None)
    

At that time, the log is copied to a new path is created under v8 prefix, with the following pattern:

<event>/year=<year>/month=<month>/day=<day>/hour=<hour>/<filename>-<nodeID>. For example, 

s3://logbucket/server1234/wins-200421-000003.txt.gz

is copied to 

s3://logbucket/v8/wins/year=2020/month=04/day=21/hour=00/wins-200421-000003-1234.txt.gz

(The “v8” prefix is there because I have arrived at this schema having tried several versions — 7, to be exact).

What about storage cost?

  • An additional benefit of using date-based partitioning is that we can easily automate changing storage type to Glacier for folders older than some specified time, which will save S3 storage costs of the duplicate data.
  • In the cloud, the storage costs are not something to worry about; the outgoing traffic and compute is where the problem is at. So move the compute to the data, not the other way around.

NOTE: Partitions do not equal timestamps

Partitioning is based on the file name. Records inside the file may have timestamps whose hour one greater than or one less than the partition hour, for obvious reasons. Thus, partitions are there to reduce the number of scanned records, but care should be taken when querying to not assume that the timestamps under year=2020/month=04/day=21/hour=00 all are in the 0th hour of 2020-04-21.

Discover metadata in Glue

Glue is an ETL service on AWS. One of the great features of Glue is crawlers that attempt to glean metadata from the logs. This is really convenient because it saves us the tedious step of defining the metadata manually.

So we set up a crawler. For explanation on how to do it, see the “Tutorial” link on the left hand side of this page:

However, it takes some time to get the configuration correct. 

  1. We would want to exclude some logs because we know their format is not good for discovery (they are not straight-up JSON or CSV, etc; and, at the moment, custom SerDes are not supported by Athena) — but see below for exceptions. This is done in the “Data store” part of crawler configuration:
  1. We want Glue to treat new files of the same type as being different partitions of the same table, not create new ones. For example, given the partitioning convention we created above, these two paths:
  • s3://logbucket/v8/wins/year=2020/month=04/day=21/hour=00/wins-200421-000003-1234.txt.gz
  • s3://logbucket/v8/wins/year=2020/month=04/day=22/hour=11/wins-200422-000004-4321.txt.gz

 Should be treated as partitions of table “wins”, not two different tables. We do this on the “Output” section of the crawler configuration as follows:

Once the crawler runs, we will see a list of tables created in Glue.

If here we see tables looking like parts of the partitioned path (e.g., year=2020, or ending with _txt_gz), it means the crawler got confused when discovering their structure. We are adding those to the crawler’s exclusion list, and will create their metadata manually. Fortunately, there are not that many such logs, and we can exclude them one by one.

Of course, while the crawler can recognize the file structure, it doesn’t know what to name the fields. So we can go and name them manually. While this is a tedious process, we don’t have to do that all at once – just do it on the as-needed basis. 

We will want to keep the crawler running hourly so that new partitions (which get created every hour) are picked up. (This can also be done manually from Athena – or Hive — by issuing MSCK REPAIR TABLE command).

First useful Athena query

Looking now at Athena, we see that metadata from Glue is visible and we can run queries:

Woohoo! It works! I can do nice ad hoc queries. We’re done, right?

Almost. Unfortunately, for historical reasons, our logs are not always formatted to work with this setup ideally.

We could identify two key cases:

  1. Mostly CSV files:
    • There are event prefixes preceding the event ID, even though the event itself is already defined by the log name. For example, bid:<BID_ID>, e.g.:
      bid:0000-1111-2222-3333
    • A CSV field in itself contains really two values. E.g., a log that is comma-separated into two fields: timestamp and “message”, which includes “Win: ” prefix before bid ID, and then – with no comma! – “price: ” followed by price. Like so:
      04/21/2020 00:59:59.722,Win: a750a866-8b1c-49c9-8a30-34071167374e_200421-00__302 price:0.93
      However, what we want to join on is the ID. So in these cases, we can use Athena views. For example, in these respective cases we can use:
       CREATE OR REPLACE VIEW bids2 AS
      SELECT "year", "month", "day", "hour",
      "SUBSTRING"("bid_colon_bid_id", ("length"('bid:') + 1)) "bid_id"
      FROM bids

      and
       CREATE OR REPLACE VIEW bids2 AS
      SELECT "year", "month", "day", "hour",
      "SPLIT_PART"("message", ' ', 3) "bid_id"
      FROM wins

      Now joins can be done on bid_id column, which makes for a more readable query.
  2. The other case is a log that has the following format: a timestamp, followed by a comma, followed by JSON (an OpenRTB request wrapped in one more layer of our own JSON that augments it with other data). Which makes it neither CSV, nor JSON. Glue crawler gets confused. The solution is using RegEx SerDe, as follows:

    And then we can use Athena’s JSON functions to deal with the JSON column, for example, to see distribution of requests by country:
     SELECT JSON_EXTRACT_SCALAR(request, '$.bidrequest.device.geo.country') country,
    COUNT(*) cnt
    FROM requests
    GROUP BY
    JSON_EXTRACT_SCALAR(request, '$.bidrequest.device.geo.country')
    ORDER BY cnt DESC
    Success! We can now use SQL to easily query our event logs.

Helpful links

Monkey business


After reading <a
href=http://discuss.fogcreek.com/joelonsoftware/default.asp?cmd=show&ixPost=75607&ixReplies=51>this, I thought I’ll put in my couple of bucks… This is a RANT!

It seems that the orientation in business is towards “monkey”
programmers — those who do not think, but do as they are
told. This is because management, apparently (and justifiably),
believes that at any given time it is easier to hire a hundred
monkeys (those are trained ones, that do not type randomly,
and so less than a million and less than infinite time will suffice,
but this is not a good analogy anyway), than a Shakespeare – or even Dumas
(with his own monkeys, so that’s another bad analogy, woe is me!)

As a result, there are (the list is by no means exhaustive; Java is
the language unless otherwise specified — I think Java has produced
more monkeys who think they are software engineers than anything
else — at least VB does not lend one an air of superiority):

  • …monkeys who would rather sharpen the
    carpal-syndrome-inducing skills of cutting and pasting the same
    thing over and over again, rather than learn something like sed or Perl or a
    similar tool —
    or, indeed, spend some effort finding out about the existence
    of such tools and their availability on the monkey platform of
    choice (read: Windows) — or even finding out what plugins are
    available for their lovely IDE.

    IBM, for example, provides a framework called
    EAD4J, Enterprise
    Application Development for Java (it is only available with
    purchase of IBM IGS consulting services). It includes components
    similar to Struts, log4j, etc.
    The framework is well designed, but here is a catch — because
    of its design, adding or changing a service requires changes to
    about 8 files. There are abstract processes, process factories,
    interfaces, factories, XML files with queries, files containing constants to
    look up these queries, etc., etc. It would really be nice if there was
    a simple way to manage it, plugging in your logic where
    some IDE plugin or script do the, well, monkey
    job. Otherwise it’s overdesigned.

    Now, there are simple plugins for the current IDE of choice, WSAD, that at
    least allow generating these
    standard files (if not managing them, which is also important —
    change one signature, and you have to change several
    files). These plugins are provided by IGS
    But nooo, the monkeys here prefer to create all of this by hand. It’s
    a painful sight.

  • …macaques who cannot fathom how one
    could write a client-server application that does not communicate through
    XML requests embedded in HTTP, but – o, horror! – actually has its own
    application layer protocol.

  • …baboons who think that
    patterns
    are not merely possible (albeit very good) approaches to problems
    (and indeed are generalizations of good approaches to common
    problems that have arisen). In fact, they
    are the only way to solve problems, and that they must be copied from of
    the book, or else it wouldn’t work. They wouldn’t know a pattern they
    haven’t read about if it bit them on that place their head is
    forever hidden in. If GoF didn’t write about it, it ain’t a pattern.

  • Ok, I am tired of enumerating primate species. I’ll
    just give an anecdote.

    I wrote a module used by several teams. Because of the ever-changing
    requirements, some methods and classes became
    useless. I gave a fair warning by email, then I gave a second one by
    marking them deprecated in the code. I notice that the
    deprecated
    tags were periodically removed. I send mail about this, and mark them
    deprecated again. And again. And again.

    A monkey who was the team leader of another team came complaining that
    I should remove it, because he cannot perform a build. Everyone else
    can,
    but he can’t, and so I should remove the single tag (that is probably
    more useful to the whole project than anything he’s ever
    produced). He cannot be bothered to find out how to make
    it work? Why can everyone else make it work? Oh, he’s using some Ant
    scripts? What? That’s an excuse? What the hell does that
    have to do with anything? Oh, he didn’t write those
    scripts. Well, write your own, or take them from those
    people for whom they work. Oh, you don’t have time? Well,
    I don’t have time to keep giving you warnings you just
    ignore, you twit.

    Screw you, I finally thought, the warning has been there for some
    time. I’ll just remove this stuff altogether.
    His build promptly crashed. “Not my problem – we talked about this
    over 5 weeks ago!”, I gloated, producing the emails from my
    appropriately named CYA folder.

    As Butch said, “that’s what he gets for fucking up my sport.”

In short, they are not
Joel’s kind of programmers,
to put it mildly. Monkeys see and monkeys do. They do not think. They
have been taught a way to do things, and it is beyond them to figure
out that there could be another way. I honestly do not think they
understand what a boolean is (I submit that in their mind there is an
if statement, and then there’s a boolean type)
when they write:


if (thingie.isOk()) {

    return true;

} else {

    return false;

}

Then someone they blindly trust (it must be an established authority,
like a book/magazine — only that approved by an already established
authority, because monkeys do not further their education on their
own, — a manager, instructor at a paid course) tells
them about a ternary operator. Now they write:


return thingie.isOk() ? true : false;

The above two examples are from an actual production code.

Further, because monkeys do not think, they often reinvent the
wheel, badly. Which is also ironic, because they have been imbued with
all the right (and wrong) buzzwords, including “reuse”. I hesitate
to hazard a guess as to whether there is some meaning in their heads
they associate with this word, or is it just something they cry out
when playing free associations with their shrinks (“OO –
Encapsulation! Polymorphism! Reuse!”).

Here are some more anecdotes.

  • One programmer on a project wrote his own utilities to convert things
    from/to hex numbers, for crying out loud. Here is Java, the only thing
    he knows at all, and he can’t be bothered to think that maybe,
    just maybe, such a thing is a part of standard API.

  • This same monkey took several weeks to write a
    parser (for a very simple
    grammar, containing only certain expressions and operators such as
    ANDs and ORs). When I asked him why he didn’t use a
    parser generator (such as ANTLR, CUPS or JavaCC), he replied that
    he didn’t know any of them. Now, it is not a crime not to know a
    particular technology, but surely a programmer must be a) aware
    that there are such things as parser generators, and b) be
    able to learn how to use one. Whether he lacked the understanding or the
    desire to learn, is this the kind of developer you want?

  • Background: We needed to create some scripts doing export from the
    database. The export was to be done under some specific
    conditions, which were to be specified in the queries
    (that is, only export dependent tables if their parent
    tables are eligible to be exported, etc.) The logic was
    only in SQL queries, the rest were just scripts passing
    these queries to DB2 command-line, logging everything.
    All of those were written by hand, 80% time spent copying
    and pasting things, and then looking for places where the
    pasted things needed to be changed a bit (for example,
    some things are exported several times into different IXF
    files, because they are dependent on different
    things. These files need to be numbered sequentially, so
    next one does not overwrite the other. What do monkeys do?
    Number them by hand. Great.)

    When I suggested automating things, in fact, automating
    from the first step – even before writing our own queries,
    using the metadata to generate the
    queries themselves, I was looked at as if I just escaped
    from the mental asylum.

    Monkey But you cannot just rely on metadata, there are also
    functional links which are not foreign keys.

    Me Why are they not foreign keys in the first place?

    Monkey Because they are functional.

    Me Stop using that word. Tell me why are they not foreign keys?

    Monkey Because they are nullable.

    Me A foreign key can be nullable! Why is it not a foreign key?
    OK, whatever, that’s our DBA’s problem… But there’s a convention for
    functional keys anyway (we know they all start with
    SFK_, by convention). I’ll use that.

    Two days pass. My script works. A week later, they have problems
    with their original scripts. My approach works,
    demonstrably. But ok, they want to keep doing it their
    way, fine. They ask for help with their way – those scripts, wrapping
    hand-made SQL queries (which are already being automatically
    generated, but I’ll hold on that for now…)

    Monkey What are you doing?

    Me Writing a Perl script.

    Monkey But there is no Perl on Windows.

    Me See, I am sitting at a Windows machine and I have Perl.

    Monkey What is it for? I thought Perl was only for the Web?

    Me I am writing a script to generate your silly scripts
    from the small set of user input. The resulting files, which you
    are now doing BY HAND, are cluttered with repetitive stuff, such
    as error-handling code and file numbering, and it’s error-prone to do
    search and
    replace manually. So we’ll generate all these scripts using my script.

    Monkey But they don’t have Perl on their Windows.

    Me Who are “they”?

    Monkey The client?

    Me First of all, this is for the AIX machine. Second, this is
    not for them, we will just deliver the generated shell scripts, the
    Perl script is for us only.

    It takes several iterations for Monkey to get it.

    A day passes…

    Me Hey, where’s my Perl script I wrote to generate the import
    scripts?

    Monkey We have to have only shell scripts.

    Me Yes, I used that one to create those shell scripts, dammit!

    Monkey (sits writing these shell scripts again by hand. At the
    moment, manually replacing some upper-case strings into lower-case)
    I
    removed it from CVS. They only want shell scripts on their machine.

    Me It wasn’t going on their machine! It’s only for us!!!

    Monkey Here, I changed these files already, you change the
    rest.

    Me (giving up) OK.

    Monkey Oh, and they have to be K-shell. Change them all to
    .ksh.

    Me Why do they have to be ksh? What’s wrong with sh? They are
    all very simple anyway, just call db2 import, check error status,
    that’s it.

    Monkey They have to be K shell. That’s what the DBA said.

    Me What the hell does the DBA have to do with it?

    Monkey He wants to be able change them, and he doesn’t know sh,
    only ksh.

    Me Ok, fine. I suppose you’re right, echo is
    different in K-shell.

    Monkey misses the seething sarcasm. Of course.

    Monkey Right here.

    Me I don’t see them. What is this OAD_0035.ksh? Is that it?

    Monkey Yes.

    Me What does this mean? What do these numbers mean?

    Monkey That’s what they said they should be called.

    Me Who are “they”???

    Silence.

    Me OK, you have a script called OAD_0035.ksh calling
    OAD_0038.ksh, which in turn calls OAD_0038_1.ksh, OAD_0039_2.ksh, etc.
    Why are they called this? It’s hard to remember which one is which.

    Monkey Why do you want to know what it means?

    Me Because if I don’t know what it means, it’s much harder for
    me to look at the file and see what is supposed to be inside. Ah, I
    see
    you added the insightful comment inside each file with its meaningful
    name. Ah, I see also, you use that stupid name inside of it over and
    over again, to write to the logs, instead of just using $0. (deep
    breath). I’ll just create some symbolic links to them with meaningful
    names, so I know what’s going on…

    An hour later

    Me Where are my links?

    Monkey They only wanted files there that are named like
    OAD_0035, etc.
    Me What the hell do these numbers mean???

    Monkey I don’t know. For security.

    Me (pause) Who told you to do this?

    Monkey The client.

    Me The client is a company. Who have you met from the company?

    Monkey I don’t know. They said the client wants this.

    Me Who said? Where? When?

    Now I’m really curious. I turn with this question to
    others. Finally I come to the last monkey who knows.

    Monkey 5 Uh, the client has guidelines on what they are
    supposed to be called. It should start with OAD, and
    then underscore, then four characters.

    Me Why four characters?

    Monkey 5 For normalization.

    Me What normalization?! What can you possibly
    change in a
    luminous egg
    , I mean, normalize in a 20-line shell script?

    Monkey 5 So they can keep them consistent and do some things to
    all of them, regardless of what they are called.

    Me What can they possibly want to do with shell scripts? Rename
    them to some other numeric pattern? There isn’t even any method as to
    how they are named, it’s not like a certain number
    pattern means it’s dependent on the other. You just
    named them randomly…

    Curtain

    But hey, fire one, and the replacement is easy to find. That’s true.
    I suppose Henry Ford would be proud, but isn’t this a backward
    approach? You don’t need monkeys at all, most of this work can
    be automated.

    Maybe I need another line of work. 🙂