Here I will describe the Nanoput project, which comprises a large part of OpenDSP’s DMP (Data Management Platform). There are, of course, other pieces — the entire picture will be painted under the DMP tag.
Overall DMP stack is as follows:
- Nanoput proper: NGINX with Lua to handle . I’ll confess that at the moment this is a bit of a pet (as in not cattle). We are considering using OpenResty instead of rolling our own, which uses parts of OpenResty. But no matter, here I will present some features that can be achieved with this setup — and one instance is capable of handling all this.
- Redis for storing and manipulating user sets — ZSET is great
- MySQL for storing metadata — will be described in a separate post
- PHP/JS for a simple Web interface to define the said metadata
- Python for translating metadata into the configuration for NGINX
- AWS S3 for storing raw logs — pre-partitioned so that EMR can be used easily.
Conceptually, let’s consider the idea of an “event”. An impression, a conversion, a video tracking event, a site visit, etc, is an event — anything that fires a request to our Nanoput is. You may recognize a similarity with Snowplow — and that is because we are solving a similar problem.
- Exchanges or DMPs, as an exchange-initiated cookie-sync: see below.
- Regular user behavior — impressions, in case of video, video-tracking events; conversions
Now, let us also consider the idea of a “user segment”. If you think about it, a segment is just a set of users. Thus, we may as well consider a user that produced a certain event as belonging to some segment. These may be explicitly asked for, such as “users we want to retarget“, or “users who converted”, etc. But there is no reason why any event cannot be defined as a segment-defining one.
Segments, here, are a special case of data collections concept discussed in a different post.
Given that, we can now dive into Nanoput implementation
General data acquisition idea
Here, we simply leverage basic NGINX functionality, that is logging. To that end, we split the main config file into sections that we include that deal with log format and location and behavior.
Static data acquisition URLs
By “static”, here we mean common use cases that are just part of Nanoput (hence the “man” subdirectory you will notice in examples — stands for, just like the Unix man command, for “manual”). Here we have:
- Site events (essentially, those are an extension of retargeting concept).
- Standard event tracking — by which we mean, standard events that happen in Ad world.
Notice that we also augment information available from NGINX (HTTP headers, etc.) with geo data using GeoIP module and user-agent/device/OS
Dynamic (metadata-driven) data acquisition URLs
Dynamic data acquisition works simply: a process reads the metadata table and creates appropriate entries in the NGINX configs that define log format and location and behavior.
On every “event”, consider script. We use awesome Redis’s Sorted Set functionality here inserting things twice. The key idea here, again, is a variation on dealing with data gravity concerns by just duplicating storage. We create two sorted sets for each key, the “score” being the first and last time we have seen the user. The reasoning for this is that:
- First-seen: we can write batch scripts to get rid of users we have last seen over X days ago (expiring targeting).
- Last-seen: helps us with conversion attribution (yes, this assumes naive last-click attribution or variants.
Duplication is not just for every user — it is for every set. The key here is the set (or segment) name, and the value is the set of users.
An added benefit of this is that new segments can be created by using various Redis set operations (union, intersection) easily.
Some useful shortcuts for a DMP
- Getting OS/browser info without necessarily using WURFL (though that can easily be fronted by NGINX too, actually).
Exchange cookie sync
In the display world, there is a need for cookie syncing between DSP and a third-party DMP or an exchange/SSP, and that can be either exchange, DMP or DSP-initiated, or both. Some exchanges may allow the redirect chain to proceed further, some may not. Nanoput provides this functionality for exchanges we deal with as well as a template for doing it for other partners — at the speed that NGINX provides. Here are the moving parts:
Storing for further analysis
Raw logs, formatted as above, are uploaded to S3. Notice that they are stored twice, with different partitioning schema. This is one of the key ideas in Nanoput — storage is cheap; duplicating the storage this way and then using one or another partitioning schema depending on the use case: