Ad-hoc querying on AWS: Connecting BI tools to Athena

May 5, 2020November 27, 2022 Gregory Golberg mz athena, aws, bi, data-lake, dbvisualizer, excel, howto, jdbc, lakehouse, odbc, sql, tutorial

In a previous post, we discussed using Lambda, Glue and Athena to set up queries of events that are logged by our real-time bidding system. Here, we will build on that foundation, and show how to make this even friendlier to business users by connecting BI tools to this setup.

Luckily, Athena supports both JDBC and ODBC, and, thus, any BI tool that uses either of these connection methods can use Athena!

First, we need to create an IAM user. The the minimum policies required are:

AmazonAthenaFullAccess
Writing to a bucket for Athena query output (use an existing one or create a new one). For the sake of example, let’s call it s3://athena.out
Reading from our s3://logbucket which is where the logs are in

Now we’ll need the access and the secret keys for that user to use it with various tools.

JDBC

JDBC driver (com.simba.athena.jdbc.Driver) can be downloaded here.

The JDBC URL is constructed as follows:

jdbc:awsathena://User=<aws-access-key>;Password=<aws-secret-key>;S3OutputLocation=s3://athena.out;

Here’s a sample Java program that shows it in action:

import java.sql.Connection;
import java.sql.DriverManager;
 
public class Main {
  public static void main(String[] args) throws Throwable {
    Class.forName("com.simba.athena.jdbc.Driver");
    String accessKey = "...";
    String secretKey = "...";
    String bucket = "athena.out";
    String url = "jdbc:awsathena://AwsRegion=us-east-1;User=" + accessKey + ";Password=" + secretKey
        + ";S3OutputLocation=s3://" + bucket +”;";
    Connection connection = DriverManager.getConnection(url);
    System.out.println("Successfully connected to\n\t" + url);
  }
}

Example using JDBC: DbVisualizer

If you haven’t already, download the JDBC driver to some folder.
Open Driver Manager (Tools-Driver Manager)

Press green + to create new driver

Press the folder icon on the right …

… and browse to the folder where you saved the JDBC driver and select it:

Leave the URL Format field blank and pick com.simba.athena.jdbc.Driver for Driver Class:

Close the Driver Manager, and let’s create a Connection:

We’ll use “No Wizard” option. Pick Athena from the dropdown in the Driver (JDBC) field and enter the JDBC URL from above in the Database URL field:
Press “Connect” and observe DbVisualizer read the metadata information from Athena (well, Glue, really), including tables and views.

ODBC (on OSX)

Download run ODBC driver installer

Create or edit /Library/ODBC/odbcinst.ini to add the following information:

[ODBC Drivers]
Simba Athena ODBC Driver=Installed
[Simba Athena ODBC Driver]
Driver = /Library/simba/athenaodbc/lib/libathenaodbc_sbu.dylib

If the odbcinst.ini file already has entries, put new entries into the appropriate sections; e.g., if it was

[ODBC Drivers]
PostgreSQL Unicode = Installed
[PostgreSQL Unicode]
Description = PostgreSQL ODBC driver
Driver = /usr/local/lib/psqlodbcw.so

Then it becomes

 [ODBC Drivers]
 PostgreSQL Unicode = Installed
 Simba Athena ODBC Driver=Installed
 [PostgreSQL Unicode]
 Description = PostgreSQL ODBC driver
 Driver = /usr/local/lib/psqlodbcw.so
 [Simba Athena ODBC Driver]
 Driver = /Library/simba/athenaodbc/lib/libathenaodbc_sbu.dylib

Create or edit, in a similar fashion, /Library/ODBC/odbc.ini to include the following information:

[AthenaDSN]
Driver=/Library/simba/athenaodbc/lib/libathenaodbc_sbu.dylib
AwsRegion=us-east-1
S3OutputLocation=s3://athena.out
AuthenticationType=IAM Credentials
UID=AWS_ACCESS_KEY
PWD=AWS_SECRET_KEY

If you wish to test, download and run ODBC Manager. You should see that it successfully recognizes the DSN:

Example using ODBC: Excel

Switch to Data tab, and under New Database Query select From Database:

In the iODBC Data Source Chooser window, select AthenaDSN we configured above and hit OK.

Annoyingly, despite having configured it, you will be asked for credentials again. Enter the access and secret key.
You should see a Microsoft Query window.

Success!

Helpful links

Connecting to Athena with JDBC and ODBC
Manual for JDBC driver
Manual for ODBC driver

Ad-hoc querying on AWS: Lambda, Glue, Athena

April 22, 2020November 27, 2022 Gregory Golberg mz athena, aws, big-data, data-lake, glue, howto, lakehouse, lambda, openrtb, rtb, sql, tutorial

Introduction

If you give different engineers the same problem they will usually produce reasonably similar solutions (mutatis mutandis). For example, when I first came across a reference implementation of an RTB platform using AWS, I was amused by how close it was to what we have implemented in one of my previous projects (OpenRTB).

So it would be not much of a surprise that in the next RTB system, a similar pattern was used: logs are written to files, pushed to S3, then aggregated in Hadoop from where the reports are run.

But there were a few problems in the way…

Log partitioning

The current log partitioning in S3 is by server ID. This is really useful for debugging, and is fine for some aggregations, but not really good for various reasons – it is hard to narrow things by date, resulting in large scans. It is, therefore, even harder to do joins. Large scans in tools like Athena also translate into larger bills. In short, Hive-like partitioning would be good.

To that end, I’ve created a Lambda function, repartition, which is triggered when a new log file is uploaded to s3://logbucket/ bucket:

import boto3
from gzip import GzipFile
from io import BytesIO
import json
import urllib.parse

s3 = boto3.client('s3')

SUFFIX = '.txt.gz'

V = "v8"

def lambda_handler(event, context):
    #print("Received event: " + json.dumps(event, indent=2))

    # Get the object from the event and show its content type
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
    try:
        s3obj = s3.get_object(Bucket=bucket, Key=key)
        src = { 'Bucket': bucket, 'Key': key }
        (node, orig_name) = key.split("/")
        (_, _, node_id) = node.split("_")
        name = orig_name.replace(SUFFIX, "")
        (evt, dt0, hhmmss) = name.split("-")
        hr = hhmmss[0:2]
        # date-hour
        dthr = f"year=20{dt0[0:2]}/month={dt0[2:4]}/day={dt0[4:6]}/hour={hr}"
        
        schema = f"{V}/{evt}/{dthr}"
        dest = f"{schema}/{name}-{node_id}{SUFFIX}"
        print(f"Copying {key} to {dest}")
        s3.copy_object(Bucket=bucket, Key=dest, CopySource=src)

        return "OK"
    except Exception as e:
        print(e)
        print('Error getting object {} from bucket {}. Make sure they exist and your bucket is in the same region as this function.'.format(key, bucket))
        raise e

if __name__ == "__main__":
    # Wrapper to run from CLI for now
    s3entry = {'bucket' : {'name' : 'logbucket'},
               'object' : {'key'  : server/requests-200420-003740.txt.gz'}}
    event = {'Records' : [{'s3' : s3entry}] }
    lambda_handler(event, None)

At that time, the log is copied to a new path is created under v8 prefix, with the following pattern:

<event>/year=<year>/month=<month>/day=<day>/hour=<hour>/<filename>-<nodeID>. For example,

s3://logbucket/server1234/wins-200421-000003.txt.gz

is copied to

s3://logbucket/v8/wins/year=2020/month=04/day=21/hour=00/wins-200421-000003-1234.txt.gz

(The “v8” prefix is there because I have arrived at this schema having tried several versions — 7, to be exact).

What about storage cost?

An additional benefit of using date-based partitioning is that we can easily automate changing storage type to Glacier for folders older than some specified time, which will save S3 storage costs of the duplicate data.
In the cloud, the storage costs are not something to worry about; the outgoing traffic and compute is where the problem is at. So move the compute to the data, not the other way around.

NOTE: Partitions do not equal timestamps

Partitioning is based on the file name. Records inside the file may have timestamps whose hour one greater than or one less than the partition hour, for obvious reasons. Thus, partitions are there to reduce the number of scanned records, but care should be taken when querying to not assume that the timestamps under year=2020/month=04/day=21/hour=00 all are in the 0th hour of 2020-04-21.

Discover metadata in Glue

Glue is an ETL service on AWS. One of the great features of Glue is crawlers that attempt to glean metadata from the logs. This is really convenient because it saves us the tedious step of defining the metadata manually.

So we set up a crawler. For explanation on how to do it, see the “Tutorial” link on the left hand side of this page:

However, it takes some time to get the configuration correct.

We would want to exclude some logs because we know their format is not good for discovery (they are not straight-up JSON or CSV, etc; and, at the moment, custom SerDes are not supported by Athena) — but see below for exceptions. This is done in the “Data store” part of crawler configuration:

We want Glue to treat new files of the same type as being different partitions of the same table, not create new ones. For example, given the partitioning convention we created above, these two paths:

s3://logbucket/v8/wins/year=2020/month=04/day=21/hour=00/wins-200421-000003-1234.txt.gz
s3://logbucket/v8/wins/year=2020/month=04/day=22/hour=11/wins-200422-000004-4321.txt.gz

Should be treated as partitions of table “wins”, not two different tables. We do this on the “Output” section of the crawler configuration as follows:

Once the crawler runs, we will see a list of tables created in Glue.

If here we see tables looking like parts of the partitioned path (e.g., year=2020, or ending with _txt_gz), it means the crawler got confused when discovering their structure. We are adding those to the crawler’s exclusion list, and will create their metadata manually. Fortunately, there are not that many such logs, and we can exclude them one by one.

Of course, while the crawler can recognize the file structure, it doesn’t know what to name the fields. So we can go and name them manually. While this is a tedious process, we don’t have to do that all at once – just do it on the as-needed basis.

We will want to keep the crawler running hourly so that new partitions (which get created every hour) are picked up. (This can also be done manually from Athena – or Hive — by issuing MSCK REPAIR TABLE command).

First useful Athena query

Looking now at Athena, we see that metadata from Glue is visible and we can run queries:

Woohoo! It works! I can do nice ad hoc queries. We’re done, right?

Almost. Unfortunately, for historical reasons, our logs are not always formatted to work with this setup ideally.

We could identify two key cases:

Mostly CSV files:
- There are event prefixes preceding the event ID, even though the event itself is already defined by the log name. For example, bid:<BID_ID>, e.g.:
  bid:0000-1111-2222-3333
- A CSV field in itself contains really two values. E.g., a log that is comma-separated into two fields: timestamp and “message”, which includes “Win: ” prefix before bid ID, and then – with no comma! – “price: ” followed by price. Like so:
  04/21/2020 00:59:59.722,Win: a750a866-8b1c-49c9-8a30-34071167374e_200421-00__302 price:0.93
  However, what we want to join on is the ID. So in these cases, we can use Athena views. For example, in these respective cases we can use:
```
 CREATE OR REPLACE VIEW bids2 AS
 SELECT  "year", "month", "day", "hour",
 "SUBSTRING"("bid_colon_bid_id", ("length"('bid:') + 1)) "bid_id"
 FROM bids 
```
  and
```
 CREATE OR REPLACE VIEW bids2 AS
 SELECT "year", "month", "day", "hour",
 "SPLIT_PART"("message", ' ', 3) "bid_id"
 FROM  wins
```
  Now joins can be done on bid_id column, which makes for a more readable query.
The other case is a log that has the following format: a timestamp, followed by a comma, followed by JSON (an OpenRTB request wrapped in one more layer of our own JSON that augments it with other data). Which makes it neither CSV, nor JSON. Glue crawler gets confused. The solution is using RegEx SerDe, as follows:

And then we can use Athena’s JSON functions to deal with the JSON column, for example, to see distribution of requests by country:
```
 SELECT JSON_EXTRACT_SCALAR(request, '$.bidrequest.device.geo.country') country,
 COUNT(*) cnt
 FROM requests
 GROUP BY
 JSON_EXTRACT_SCALAR(request, '$.bidrequest.device.geo.country')
 ORDER BY cnt DESC
```
Success! We can now use SQL to easily query our event logs.

Helpful links

Amazon Athena User Guide on GitHub – official AWS guide but it appears to be organized better than the other
Performance Tuning Tips for AWS Athena

OpenDSP’s DMP: Metadata-driven data collection

August 28, 2015January 19, 2020 Gregory Golberg opendsp data, dmp, dsp, metadata, nanoput, opendsp, rtb, sql

Here I will discuss metadata-driven data collection platform.

Monkey business

November 24, 2003 Gregory Golberg holding forth automation, java, rant, scripting, sql

After reading <a
href=http://discuss.fogcreek.com/joelonsoftware/default.asp?cmd=show&ixPost=75607&ixReplies=51>this, I thought I’ll put in my couple of bucks… This is a RANT!

It seems that the orientation in business is towards “monkey”
programmers — those who do not think, but do as they are
told. This is because management, apparently (and justifiably),
believes that at any given time it is easier to hire a hundred
monkeys (those are trained ones, that do not type randomly,
and so less than a million and less than infinite time will suffice,
but this is not a good analogy anyway), than a Shakespeare – or even Dumas
(with his own monkeys, so that’s another bad analogy, woe is me!)

As a result, there are (the list is by no means exhaustive; Java is
the language unless otherwise specified — I think Java has produced
more monkeys who think they are software engineers than anything
else — at least VB does not lend one an air of superiority):

…monkeys who would rather sharpen the
carpal-syndrome-inducing skills of cutting and pasting the same
thing over and over again, rather than learn something like sed or Perl or a
similar tool —
or, indeed, spend some effort finding out about the existence
of such tools and their availability on the monkey platform of
choice (read: Windows) — or even finding out what plugins are
available for their lovely IDE.

IBM, for example, provides a framework called
EAD4J, Enterprise
Application Development for Java (it is only available with
purchase of IBM IGS consulting services). It includes components
similar to Struts, log4j, etc.
The framework is well designed, but here is a catch — because
of its design, adding or changing a service requires changes to
about 8 files. There are abstract processes, process factories,
interfaces, factories, XML files with queries, files containing constants to
look up these queries, etc., etc. It would really be nice if there was
a simple way to manage it, plugging in your logic where
some IDE plugin or script do the, well, monkey
job. Otherwise it’s overdesigned.

Now, there are simple plugins for the current IDE of choice, WSAD, that at
least allow generating these
standard files (if not managing them, which is also important —
change one signature, and you have to change several
files). These plugins are provided by IGS
But nooo, the monkeys here prefer to create all of this by hand. It’s
a painful sight.
…macaques who cannot fathom how one
could write a client-server application that does not communicate through
XML requests embedded in HTTP, but – o, horror! – actually has its own
application layer protocol.
…baboons who think that
patterns
are not merely possible (albeit very good) approaches to problems
(and indeed are generalizations of good approaches to common
problems that have arisen). In fact, they
are the only way to solve problems, and that they must be copied from of
the book, or else it wouldn’t work. They wouldn’t know a pattern they
haven’t read about if it bit them on that place their head is
forever hidden in. If GoF didn’t write about it, it ain’t a pattern.
Ok, I am tired of enumerating primate species. I’ll
just give an anecdote.

I wrote a module used by several teams. Because of the ever-changing
requirements, some methods and classes became
useless. I gave a fair warning by email, then I gave a second one by
marking them deprecated in the code. I notice that the
deprecated
tags were periodically removed. I send mail about this, and mark them
deprecated again. And again. And again.

A monkey who was the team leader of another team came complaining that
I should remove it, because he cannot perform a build. Everyone else
can,
but he can’t, and so I should remove the single tag (that is probably
more useful to the whole project than anything he’s ever
produced). He cannot be bothered to find out how to make
it work? Why can everyone else make it work? Oh, he’s using some Ant
scripts? What? That’s an excuse? What the hell does that
have to do with anything? Oh, he didn’t write those
scripts. Well, write your own, or take them from those
people for whom they work. Oh, you don’t have time? Well,
I don’t have time to keep giving you warnings you just
ignore, you twit.

Screw you, I finally thought, the warning has been there for some
time. I’ll just remove this stuff altogether.
His build promptly crashed. “Not my problem – we talked about this
over 5 weeks ago!”, I gloated, producing the emails from my
appropriately named CYA folder.

As Butch said, “that’s what he gets for fucking up my sport.”

In short, they are not
Joel’s kind of programmers,
to put it mildly. Monkeys see and monkeys do. They do not think. They
have been taught a way to do things, and it is beyond them to figure
out that there could be another way. I honestly do not think they
understand what a boolean is (I submit that in their mind there is an
if statement, and then there’s a boolean type)
when they write:

if (thingie.isOk()) {
return true; } else { return false; }

Then someone they blindly trust (it must be an established authority,
like a book/magazine — only that approved by an already established
authority, because monkeys do not further their education on their
own, — a manager, instructor at a paid course) tells
them about a ternary operator. Now they write:

return thingie.isOk() ? true : false;

The above two examples are from an actual production code.

Further, because monkeys do not think, they often reinvent the
wheel, badly. Which is also ironic, because they have been imbued with
all the right (and wrong) buzzwords, including “reuse”. I hesitate
to hazard a guess as to whether there is some meaning in their heads
they associate with this word, or is it just something they cry out
when playing free associations with their shrinks (“OO –
Encapsulation! Polymorphism! Reuse!”).

Here are some more anecdotes.

One programmer on a project wrote his own utilities to convert things
from/to hex numbers, for crying out loud. Here is Java, the only thing
he knows at all, and he can’t be bothered to think that maybe,
just maybe, such a thing is a part of standard API.
This same monkey took several weeks to write a
parser (for a very simple
grammar, containing only certain expressions and operators such as
ANDs and ORs). When I asked him why he didn’t use a
parser generator (such as ANTLR, CUPS or JavaCC), he replied that
he didn’t know any of them. Now, it is not a crime not to know a
particular technology, but surely a programmer must be a) aware
that there are such things as parser generators, and b) be
able to learn how to use one. Whether he lacked the understanding or the
desire to learn, is this the kind of developer you want?
Background: We needed to create some scripts doing export from the
database. The export was to be done under some specific
conditions, which were to be specified in the queries
(that is, only export dependent tables if their parent
tables are eligible to be exported, etc.) The logic was
only in SQL queries, the rest were just scripts passing
these queries to DB2 command-line, logging everything.
All of those were written by hand, 80% time spent copying
and pasting things, and then looking for places where the
pasted things needed to be changed a bit (for example,
some things are exported several times into different IXF
files, because they are dependent on different
things. These files need to be numbered sequentially, so
next one does not overwrite the other. What do monkeys do?
Number them by hand. Great.)

When I suggested automating things, in fact, automating
from the first step – even before writing our own queries,
using the metadata to generate the
queries themselves, I was looked at as if I just escaped
from the mental asylum.

Monkey But you cannot just rely on metadata, there are also
functional links which are not foreign keys.

Me Why are they not foreign keys in the first place?

Monkey Because they are functional.

Me Stop using that word. Tell me why are they not foreign keys?

Monkey Because they are nullable.

Me A foreign key can be nullable! Why is it not a foreign key?
OK, whatever, that’s our DBA’s problem… But there’s a convention for
functional keys anyway (we know they all start with
SFK_, by convention). I’ll use that.

Two days pass. My script works. A week later, they have problems
with their original scripts. My approach works,
demonstrably. But ok, they want to keep doing it their
way, fine. They ask for help with their way – those scripts, wrapping
hand-made SQL queries (which are already being automatically
generated, but I’ll hold on that for now…)

Monkey What are you doing?

Me Writing a Perl script.

Monkey But there is no Perl on Windows.

Me See, I am sitting at a Windows machine and I have Perl.

Monkey What is it for? I thought Perl was only for the Web?

Me I am writing a script to generate your silly scripts
from the small set of user input. The resulting files, which you
are now doing BY HAND, are cluttered with repetitive stuff, such
as error-handling code and file numbering, and it’s error-prone to do
search and
replace manually. So we’ll generate all these scripts using my script.

Monkey But they don’t have Perl on their Windows.

Me Who are “they”?

Monkey The client?

Me First of all, this is for the AIX machine. Second, this is
not for them, we will just deliver the generated shell scripts, the
Perl script is for us only.

It takes several iterations for Monkey to get it.

A day passes…

Me Hey, where’s my Perl script I wrote to generate the import
scripts?

Monkey We have to have only shell scripts.

Me Yes, I used that one to create those shell scripts, dammit!

Monkey (sits writing these shell scripts again by hand. At the
moment, manually replacing some upper-case strings into lower-case) I
removed it from CVS. They only want shell scripts on their machine.

Me It wasn’t going on their machine! It’s only for us!!!

Monkey Here, I changed these files already, you change the
rest.

Me (giving up) OK.

Monkey Oh, and they have to be K-shell. Change them all to
.ksh.

Me Why do they have to be ksh? What’s wrong with sh? They are
all very simple anyway, just call db2 import, check error status,
that’s it.

Monkey They have to be K shell. That’s what the DBA said.

Me What the hell does the DBA have to do with it?

Monkey He wants to be able change them, and he doesn’t know sh,
only ksh.

Me Ok, fine. I suppose you’re right, echo is
different in K-shell.

Monkey misses the seething sarcasm. Of course.

Monkey Right here.

Me I don’t see them. What is this OAD_0035.ksh? Is that it?

Monkey Yes.

Me What does this mean? What do these numbers mean?

Monkey That’s what they said they should be called.

Me Who are “they”???

Silence.

Me OK, you have a script called OAD_0035.ksh calling
OAD_0038.ksh, which in turn calls OAD_0038_1.ksh, OAD_0039_2.ksh, etc.
Why are they called this? It’s hard to remember which one is which.

Monkey Why do you want to know what it means?

Me Because if I don’t know what it means, it’s much harder for
me to look at the file and see what is supposed to be inside. Ah, I
see
you added the insightful comment inside each file with its meaningful
name. Ah, I see also, you use that stupid name inside of it over and
over again, to write to the logs, instead of just using $0. (deep
breath). I’ll just create some symbolic links to them with meaningful
names, so I know what’s going on…

An hour later

Me Where are my links?

Monkey They only wanted files there that are named like
OAD_0035, etc.
Me What the hell do these numbers mean???

Monkey I don’t know. For security.

Me (pause) Who told you to do this?

Monkey The client.

Me The client is a company. Who have you met from the company?

Monkey I don’t know. They said the client wants this.

Me Who said? Where? When?

Now I’m really curious. I turn with this question to
others. Finally I come to the last monkey who knows.

Monkey 5 Uh, the client has guidelines on what they are
supposed to be called. It should start with OAD, and
then underscore, then four characters.

Me Why four characters?

Monkey 5 For normalization.

Me What normalization?! What can you possibly
change in a
luminous egg, I mean, normalize in a 20-line shell script?

Monkey 5 So they can keep them consistent and do some things to
all of them, regardless of what they are called.

Me What can they possibly want to do with shell scripts? Rename
them to some other numeric pattern? There isn’t even any method as to
how they are named, it’s not like a certain number
pattern means it’s dependent on the other. You just
named them randomly…

Curtain

But hey, fire one, and the replacement is easy to find. That’s true.
I suppose Henry Ford would be proud, but isn’t this a backward
approach? You don’t need monkeys at all, most of this work can
be automated.

Maybe I need another line of work. 🙂

DEBEDb

Fear, loathing, uncertainty, doubt, laziness, impatience, Oxford commas, and hubris

sql

Ad-hoc querying on AWS: Connecting BI tools to Athena

JDBC

Example using JDBC: DbVisualizer

ODBC (on OSX)

Example using ODBC: Excel

Helpful links

Ad-hoc querying on AWS: Lambda, Glue, Athena

Introduction

Log partitioning

What about storage cost?

NOTE: Partitions do not equal timestamps

Discover metadata in Glue

First useful Athena query

Helpful links

OpenDSP’s DMP: Metadata-driven data collection

Monkey business