Here I will discuss metadata-driven data collection platform.
Here I will describe the Nanoput project, which comprises a large part of OpenDSP’s DMP (Data Management Platform). There are, of course, other pieces — the entire picture will be painted under the DMP tag.
Overall DMP stack is as follows:
- Nanoput proper: NGINX with Lua to handle . I’ll confess that at the moment this is a bit of a pet (as in not cattle). We are considering using OpenResty instead of rolling our own, which uses parts of OpenResty. But no matter, here I will present some features that can be achieved with this setup — and one instance is capable of handling all this.
- Redis for storing and manipulating user sets — ZSET is great
- MySQL for storing metadata — will be described in a separate post
- PHP/JS for a simple Web interface to define the said metadata
- Python for translating metadata into the configuration for NGINX
- AWS S3 for storing raw logs — pre-partitioned so that EMR can be used easily.
Conceptually, let’s consider the idea of an “event”. An impression, a conversion, a video tracking event, a site visit, etc, is an event — anything that fires a request to our Nanoput is. You may recognize a similarity with Snowplow — and that is because we are solving a similar problem.
- Exchanges or DMPs, as an exchange-initiated cookie-sync: see below.
- Regular user behavior — impressions, in case of video, video-tracking events; conversions
Now, let us also consider the idea of a “user segment”. If you think about it, a segment is just a set of users. Thus, we may as well consider a user that produced a certain event as belonging to some segment. These may be explicitly asked for, such as “users we want to retarget“, or “users who converted”, etc. But there is no reason why any event cannot be defined as a segment-defining one.
Segments, here, are a special case of data collections concept discussed in a different post.
Given that, we can now dive into Nanoput implementation
General data acquisition idea
Static data acquisition URLs
By “static”, here we mean common use cases that are just part of Nanoput (hence the “man” subdirectory you will notice in examples — stands for, just like the Unix man command, for “manual”). Here we have:
- Site events (essentially, those are an extension of retargeting concept).
- Standard event tracking — by which we mean, standard events that happen in Ad world.
Dynamic (metadata-driven) data acquisition URLs
On every “event”, consider script. We use awesome Redis’s Sorted Set functionality here inserting things twice. The key idea here, again, is a variation on dealing with data gravity concerns by just duplicating storage. We create two sorted sets for each key, the “score” being the first and last time we have seen the user. The reasoning for this is that:
- First-seen: we can write batch scripts to get rid of users we have last seen over X days ago (expiring targeting).
- Last-seen: helps us with conversion attribution (yes, this assumes naive last-click attribution or variants.
Duplication is not just for every user — it is for every set. The key here is the set (or segment) name, and the value is the set of users.
An added benefit of this is that new segments can be created by using various Redis set operations (union, intersection) easily.
Some useful shortcuts for a DMP
- Getting OS/browser info without necessarily using WURFL (though that can easily be fronted by NGINX too, actually).
Exchange cookie sync
In the display world, there is a need for cookie syncing between DSP and a third-party DMP or an exchange/SSP, and that can be either exchange, DMP or DSP-initiated, or both. Some exchanges may allow the redirect chain to proceed further, some may not. Nanoput provides this functionality for exchanges we deal with as well as a template for doing it for other partners — at the speed that NGINX provides. Here are the moving parts:
- On a partner (SSP/exchange or DMP)-initiated sync, a /man/xchin URL calls xchin.lua script that, depending on known partner policy, responds with a proper redirect (coded specifically to ensure maximum redirects). You may notice that this could be initiated by the DSP itself and you would not be wrong.
- DSP itself may initiate a cookie sync, and this is controlled by xchout.lua script.
Storing for further analysis
Raw logs, formatted as above, are uploaded to S3. Notice that they are stored twice, with different partitioning schema. This is one of the key ideas in Nanoput — storage is cheap; duplicating the storage this way and then using one or another partitioning schema depending on the use case:
- Partitioned by date — useful for most internally-provided reporting
- Partitioned by user — here, it’s important to see that it’s a multi-tenant system; in this context, “user” is a client/customer. This partitioning is useful to provide customers with ability to run their own custom queries. See also notes on data duplication and data gravity.
Ok, I couldn’t resist the title, and also too, Billy Collins’ line about “No cookie nibbled by a French novelist could send one into the past more suddenly” is just awesome so why not… But I digress…
At OpenDSP, we, obviously, cookie (yes, it is a verb — to cookie) users. While more details can be had in other posts (such as Nanoput, and more under DMP tag), here I’ll just address the format issue.
The cookies we use are known as mod_uid cookies because of the Apache project. (A good writeup is done here). Because we use NGINX, and, consequently, use its uid_set/uid_get functionality, one may be puzzled when reconciling the cookie as seen sent to the user and what NGINX writes in its logs. So here
Give engineers the same problem, they all will come with roughly the same solution. But I am proud to actually have invented a different one.
Many systems, such as ad servers, RTB bidders, DSPs, etc., store the business rules (targeting, pacing, etc.) information that a user manipulates via UI in some sort of database (likely RDBMS), and then probably transform it into some sort of fast lookup for the real-time decision making. Seems straightforward.
At OpenDSP we went for a new approach which I believe is quite powerful. Simply put, we compile the rules from DB into a Groovy script (well, a set of Groovy scripts). Given that Groovy compiles to JVM bytecode, essentially we create concrete subclasses conforming to the following interfaces and using the following abstract classes:
- For ads, that is, targeting, implementing Ad interface and subclassing AdImpl
- For creatives, implementing Tag interface and subclassing TagImpl
- Finally, there is BidPriceCalculator interface, which will be described below. This is not auto-generated.
A script that compiles the DB rules into the scripts is triggered by the UI whenever anything is saved, and the Lot49 process will reload any changed scripts every so often (e.g., 5 minutes).
Generally, the algorithm is as follows: During a request, the system will take the OpenRTB bid request and augment it with various data (e.g., geo lookup, information on user, etc). It will then, for each Ad, call its canBid1() and checkSegments() methods (both must return true). canBid1() is called first, to filter out quickly those ads that won’t bid based on the data already available in the request, before we augment it with data that needs to be fetched from a user cache; after that, checkSegments() will be called on the remaining candidates. In turn, canBid1() will call the canBid() methods of all Tags, to check which, if any, creatives fit this bid request (based on media type, size, video duration if applicable, etc.)
Finally, the price is determined based on pacing and budget settings, unless the appropriate BidPriceCalculator is found for an Ad, in which case it is used to get the bid price.
The resulting scripts are essentially a bunch of getters based on the rules in the DB, and the abstract classes implementing the interfaces are just a bunch of if/then statements using those getters. So, what is the advantage of this rather than just reading rules from the DB?
The advantage is that any part of the generated script can be modified manually for custom targeting and/or bid pricing based on some model. These scripts may be edited by data scientists independently, eliminating the need for engineers to translate data scientists’ models into code! All the data that the model needs will be provided in the arguments to the appropriate methods; just implement the interface in Groovy, and you’re done!
It is important that all data is provided in the input, so the scripts do not need to concern themselves with high-latency fetching of data from somewhere. It is also safe, even if the data scientist makes a mistake. Consider: because we run in JVM, we can take advantage of the Java security model and create our own SandboxSecurityManager to prohibit network calls, harmful calls such as System.exit(), only allow it to call helpers from certain packages, etc.
One caveat, however: security model is not much help for other harmful things such as recursion or infinite loops. The idea, however, is to solve those as follows (about which we’ll talk in a later post):
- Recursion: by using static analysis on the loaded code
- Loops: either by doing the same and prohibiting loops completely, or, if undesirable, by observing which scripts run longer than some time and blacklisting those that exceed the threshold.
Let’s look at some examples:
- Here is an Ad targeting US and one of the segments, “386:fp:236” or “387:fp:236”.
Quick note: “fp” here indicates it is a first-party segment; for more information on how these are created and managed securely in a multi-tenant environment, see Nanoput and other articles on our DMP elsewhere on this blog.
- A creative for this ad is a 300×250 banner. Notice the convention of how the ID of the Ad is included in the name of the Tag. This is for demonstration purposes, and thus a lot of code is unrolled here; in reality, methods like getClickRedir() or getTagTemplate() would just be used from the TagImpl class
- Finally, BidPriceCalculator (you can see it is defined in the Ad). As can be seen, it uses a formula based on a segment score and a viewability score from Integral to come up with a price. (This, of course, is just an example.) Notice that the information about the user’s score, which is part of the DS-built model, and Viewability score are part of the augmented request that is the argument to getBidPrice() method. In other words, as noted above, the code here needs to just execute a formula, and all the data will be provided to it; this allows us to sandbox this code for safety while allowing for flexibility.
Pretty cool, if I may say so myself.
P.S. You may feel free to contrast this approach with Xandr’s Bonsai.
Music of the post: Feeling groovy
Every marketer, it seems, wants to participate in real-time bidding (RTB). But what is it that they really want?
They want an ability to price (price, not target!) a particular impression in real-time. Based on the secret-sauce business logic and data science. Fair enough.
But that secret sauce, part of their core competence, is just the tip of the iceberg — and the submerged part is all that is required to keep that tip above water. To wit:
- Designing, developing, testing and maintaining actual code for the UI for targeting, the bidder for bidding, reporting, and data management
- Scaling and deploying such code in some infrastructure (own data center,
clouds like AWS, GCE, Azure), etc.
- Integrating with all exchanges of interest, including the following steps:
- Code: passing functional tests (understanding the exchange’s requirements for parsing request and sending response)
- Infrastructure: ensuring the response is being sent to the exchange within the double-digit-millisecond limit
- Scaling: As above, but under real load (hundreds of thousands of queries per second)
- Business: Paperwork to ensure seat on the exchange, including credit agreements when necessary
- Operations: Ongoing monitoring of the operations, including technical (increased latency) and business (low fill level, high disapproval level) concerns (whether these concerns are triggered by clients, exchange partners or,
ideally, pro-actively addressed internally.
None of which is their core competence. We propose to address the underwater part. It’ll be exciting.
Enter OpenDSP. We got something cool coming up here. Stay tuned.