Here I will discuss metadata-driven data collection platform.
Here I will describe the Nanoput project, which comprises a large part of OpenDSP’s DMP (Data Management Platform). There are, of course, other pieces — the entire picture will be painted under the DMP tag.
Overall DMP stack is as follows:
- Nanoput proper: NGINX with Lua to handle . I’ll confess that at the moment this is a bit of a pet (as in not cattle). We are considering using OpenResty instead of rolling our own, which uses parts of OpenResty. But no matter, here I will present some features that can be achieved with this setup — and one instance is capable of handling all this.
- Redis for storing and manipulating user sets — ZSET is great
- MySQL for storing metadata — will be described in a separate post
- PHP/JS for a simple Web interface to define the said metadata
- Python for translating metadata into the configuration for NGINX
- AWS S3 for storing raw logs — pre-partitioned so that EMR can be used easily.
Conceptually, let’s consider the idea of an “event”. An impression, a conversion, a video tracking event, a site visit, etc, is an event — anything that fires a request to our Nanoput is. You may recognize a similarity with Snowplow — and that is because we are solving a similar problem.
- Exchanges or DMPs, as an exchange-initiated cookie-sync: see below.
- Regular user behavior — impressions, in case of video, video-tracking events; conversions
Now, let us also consider the idea of a “user segment”. If you think about it, a segment is just a set of users. Thus, we may as well consider a user that produced a certain event as belonging to some segment. These may be explicitly asked for, such as “users we want to retarget“, or “users who converted”, etc. But there is no reason why any event cannot be defined as a segment-defining one.
Segments, here, are a special case of data collections concept discussed in a different post.
Given that, we can now dive into Nanoput implementation
General data acquisition idea
Static data acquisition URLs
By “static”, here we mean common use cases that are just part of Nanoput (hence the “man” subdirectory you will notice in examples — stands for, just like the Unix man command, for “manual”). Here we have:
- Site events (essentially, those are an extension of retargeting concept).
- Standard event tracking — by which we mean, standard events that happen in Ad world.
Dynamic (metadata-driven) data acquisition URLs
On every “event”, consider script. We use awesome Redis’s Sorted Set functionality here inserting things twice. The key idea here, again, is a variation on dealing with data gravity concerns by just duplicating storage. We create two sorted sets for each key, the “score” being the first and last time we have seen the user. The reasoning for this is that:
- First-seen: we can write batch scripts to get rid of users we have last seen over X days ago (expiring targeting).
- Last-seen: helps us with conversion attribution (yes, this assumes naive last-click attribution or variants.
Duplication is not just for every user — it is for every set. The key here is the set (or segment) name, and the value is the set of users.
An added benefit of this is that new segments can be created by using various Redis set operations (union, intersection) easily.
Some useful shortcuts for a DMP
- Getting OS/browser info without necessarily using WURFL (though that can easily be fronted by NGINX too, actually).
Exchange cookie sync
In the display world, there is a need for cookie syncing between DSP and a third-party DMP or an exchange/SSP, and that can be either exchange, DMP or DSP-initiated, or both. Some exchanges may allow the redirect chain to proceed further, some may not. Nanoput provides this functionality for exchanges we deal with as well as a template for doing it for other partners — at the speed that NGINX provides. Here are the moving parts:
- On a partner (SSP/exchange or DMP)-initiated sync, a /man/xchin URL calls xchin.lua script that, depending on known partner policy, responds with a proper redirect (coded specifically to ensure maximum redirects). You may notice that this could be initiated by the DSP itself and you would not be wrong.
- DSP itself may initiate a cookie sync, and this is controlled by xchout.lua script.
Storing for further analysis
Raw logs, formatted as above, are uploaded to S3. Notice that they are stored twice, with different partitioning schema. This is one of the key ideas in Nanoput — storage is cheap; duplicating the storage this way and then using one or another partitioning schema depending on the use case:
- Partitioned by date — useful for most internally-provided reporting
- Partitioned by user — here, it’s important to see that it’s a multi-tenant system; in this context, “user” is a client/customer. This partitioning is useful to provide customers with ability to run their own custom queries. See also notes on data duplication and data gravity.
Every marketer, it seems, wants to participate in real-time bidding (RTB). But what is it that they really want?
They want an ability to price (price, not target!) a particular impression in real-time. Based on the secret-sauce business logic and data science. Fair enough.
But that secret sauce, part of their core competence, is just the tip of the iceberg — and the submerged part is all that is required to keep that tip above water. To wit:
- Designing, developing, testing and maintaining actual code for the UI for targeting, the bidder for bidding, reporting, and data management
- Scaling and deploying such code in some infrastructure (own data center,
clouds like AWS, GCE, Azure), etc.
- Integrating with all exchanges of interest, including the following steps:
- Code: passing functional tests (understanding the exchange’s requirements for parsing request and sending response)
- Infrastructure: ensuring the response is being sent to the exchange within the double-digit-millisecond limit
- Scaling: As above, but under real load (hundreds of thousands of queries per second)
- Business: Paperwork to ensure seat on the exchange, including credit agreements when necessary
- Operations: Ongoing monitoring of the operations, including technical (increased latency) and business (low fill level, high disapproval level) concerns (whether these concerns are triggered by clients, exchange partners or,
ideally, pro-actively addressed internally.
None of which is their core competence. We propose to address the underwater part. It’ll be exciting.
Enter OpenDSP. We got something cool coming up here. Stay tuned.