Validation?

February 4, 2023March 20, 2023 Gregory Golberg opendsp ad tech, ad-tech, adtech, dsl, groovy, jvm, mediamath, opendsp, rtb

I recently learned about MediaMath’s custom brain. Excited for the validation of OpenDSP‘s concept of custom bidding logic. But it is a bit limited, being just a polynomial. MediaMath’s custom bid router provides way more flexibility — but you need your own infrastructure! So — I still think our approach — DSL-based scripting — is better, because it combines both!

Acquired!

October 2, 2017January 26, 2018 Gregory Golberg opendsp acquisition, ad tech, exit, opendsp, srax

SRAX Acquires OpenDSP’s Demand-Side Platform

OpenDSP’s DMP: Metadata-driven data collection

August 28, 2015January 19, 2020 Gregory Golberg opendsp data, dmp, dsp, metadata, nanoput, opendsp, rtb, sql

Here I will discuss metadata-driven data collection platform.

OpenDSP’s DMP: Nanoput

July 8, 2015January 30, 2020 Gregory Golberg opendsp adtech, dmp, dsp, lua, nginx, openresty, openrtb, rtb

Here I will describe the Nanoput project, which comprises a large part of OpenDSP’s DMP (Data Management Platform). There are, of course, other pieces — the entire picture will be painted under the DMP tag.

Overall DMP stack is as follows:

Nanoput proper: NGINX with Lua to handle . I’ll confess that at the moment this is a bit of a pet (as in not cattle). We are considering using OpenResty instead of rolling our own, which uses parts of OpenResty. But no matter, here I will present some features that can be achieved with this setup — and one instance is capable of handling all this.
Redis for storing and manipulating user sets — ZSET is great
MySQL for storing metadata — will be described in a separate post
PHP/JS for a simple Web interface to define the said metadata
Python for translating metadata into the configuration for NGINX
AWS S3 for storing raw logs — pre-partitioned so that EMR can be used easily.

Conceptual introduction

Conceptually, let’s consider the idea of an “event”. An impression, a conversion, a video tracking event, a site visit, etc, is an event — anything that fires a request to our Nanoput is. You may recognize a similarity with Snowplow — and that is because we are solving a similar problem.

To proceed further, I must ask of a little indulgence — please bear with me. As a technologist, it really grated me to hear the words like “Javascript pixel” but I have learned to stop worrying and love the bomb. Therefore, Javascript pixel it is when it is a JS code that fires some GET URLs. Now, then, the requests are assumed to be fired as GET HTTP requests (do not have to be but that’s the primary idea behind Nanoput per se — management of metadata to ingest things via other means, like FTP upload, etc, will be addressed separately) and can originate, for example, from:

Exchanges or DMPs, as an exchange-initiated cookie-sync: see below.
Regular user behavior — impressions, in case of video, video-tracking events; conversions
Calls as if initiated by exchanges or DMPs to Nanoput but in reality are, heh, Javascript pixels

Now, let us also consider the idea of a “user segment”. If you think about it, a segment is just a set of users. Thus, we may as well consider a user that produced a certain event as belonging to some segment. These may be explicitly asked for, such as “users we want to retarget“, or “users who converted”, etc. But there is no reason why any event cannot be defined as a segment-defining one.

Segments, here, are a special case of data collections concept discussed in a different post.

Given that, we can now dive into Nanoput implementation

General data acquisition idea

Here, we simply leverage basic NGINX functionality, that is logging. To that end, we split the main config file into sections that we include that deal with log format and location and behavior.

Static data acquisition URLs

By “static”, here we mean common use cases that are just part of Nanoput (hence the “man” subdirectory you will notice in examples — stands for, just like the Unix man command, for “manual”). Here we have:

Site events (essentially, those are an extension of retargeting concept).
Standard event tracking — by which we mean, standard events that happen in Ad world.

Notice that we also augment information available from NGINX (HTTP headers, etc.) with geo data using GeoIP module and user-agent/device/OS

Dynamic (metadata-driven) data acquisition URLs

Dynamic data acquisition works simply: a process reads the metadata table and creates appropriate entries in the NGINX configs that define log format and location and behavior.

Creating “segments”

On every “event”, consider script. We use awesome Redis’s Sorted Set functionality here inserting things twice. The key idea here, again, is a variation on dealing with data gravity concerns by just duplicating storage. We create two sorted sets for each key, the “score” being the first and last time we have seen the user. The reasoning for this is that:

First-seen: we can write batch scripts to get rid of users we have last seen over X days ago (expiring targeting).
Last-seen: helps us with conversion attribution (yes, this assumes naive last-click attribution or variants.

Duplication is not just for every user — it is for every set. The key here is the set (or segment) name, and the value is the set of users.

An added benefit of this is that new segments can be created by using various Redis set operations (union, intersection) easily.

Some useful shortcuts for a DMP

Getting OS/browser info without necessarily using WURFL (though that can easily be fronted by NGINX too, actually).

Exchange cookie sync

In the display world, there is a need for cookie syncing between DSP and a third-party DMP or an exchange/SSP, and that can be either exchange, DMP or DSP-initiated, or both. Some exchanges may allow the redirect chain to proceed further, some may not. Nanoput provides this functionality for exchanges we deal with as well as a template for doing it for other partners — at the speed that NGINX provides. Here are the moving parts:

On a partner (SSP/exchange or DMP)-initiated sync, a /man/xchin URL calls xchin.lua script that, depending on known partner policy, responds with a proper redirect (coded specifically to ensure maximum redirects). You may notice that this could be initiated by the DSP itself and you would not be wrong.
DSP itself may initiate a cookie sync, and this is controlled by xchout.lua script.

Storing for further analysis

Raw logs, formatted as above, are uploaded to S3. Notice that they are stored twice, with different partitioning schema. This is one of the key ideas in Nanoput — storage is cheap; duplicating the storage this way and then using one or another partitioning schema depending on the use case:

Partitioned by date — useful for most internally-provided reporting
Partitioned by user — here, it’s important to see that it’s a multi-tenant system; in this context, “user” is a client/customer. This partitioning is useful to provide customers with ability to run their own custom queries. See also notes on data duplication and data gravity.

Thinking outside the box — and in a sandbox

March 13, 2015November 29, 2022 Gregory Golberg opendsp groovy, java, java security model, jvm, opendsp, rtb, sandbox

Give engineers the same problem, they all will come with roughly the same solution. But I am proud to actually have invented a different one.

Many systems, such as ad servers, RTB bidders, DSPs, etc., store the business rules (targeting, pacing, etc.) information that a user manipulates via UI in some sort of database (likely RDBMS), and then probably transform it into some sort of fast lookup for the real-time decision making. Seems straightforward.

At OpenDSP we went for a new approach which I believe is quite powerful. Simply put, we compile the rules from DB into a Groovy script (well, a set of Groovy scripts). Given that Groovy compiles to JVM bytecode, essentially we create concrete subclasses conforming to the following interfaces and using the following abstract classes:

For ads, that is, targeting, implementing Ad interface and subclassing AdImpl
For creatives, implementing Tag interface and subclassing TagImpl
Finally, there is BidPriceCalculator interface, which will be described below. This is not auto-generated.

A script that compiles the DB rules into the scripts is triggered by the UI whenever anything is saved, and the Lot49 process will reload any changed scripts every so often (e.g., 5 minutes).

Generally, the algorithm is as follows: During a request, the system will take the OpenRTB bid request and augment it with various data (e.g., geo lookup, information on user, etc). It will then, for each Ad, call its canBid1() and checkSegments() methods (both must return true). canBid1() is called first, to filter out quickly those ads that won’t bid based on the data already available in the request, before we augment it with data that needs to be fetched from a user cache; after that, checkSegments() will be called on the remaining candidates. In turn, canBid1() will call the canBid() methods of all Tags, to check which, if any, creatives fit this bid request (based on media type, size, video duration if applicable, etc.)

Finally, the price is determined based on pacing and budget settings, unless the appropriate BidPriceCalculator is found for an Ad, in which case it is used to get the bid price.

The resulting scripts are essentially a bunch of getters based on the rules in the DB, and the abstract classes implementing the interfaces are just a bunch of if/then statements using those getters. So, what is the advantage of this rather than just reading rules from the DB?

The advantage is that any part of the generated script can be modified manually for custom targeting and/or bid pricing based on some model. These scripts may be edited by data scientists independently, eliminating the need for engineers to translate data scientists’ models into code! All the data that the model needs will be provided in the arguments to the appropriate methods; just implement the interface in Groovy, and you’re done!

It is important that all data is provided in the input, so the scripts do not need to concern themselves with high-latency fetching of data from somewhere. It is also safe, even if the data scientist makes a mistake. Consider: because we run in JVM, we can take advantage of the Java security model and create our own SandboxSecurityManager to prohibit network calls, harmful calls such as System.exit(), only allow it to call helpers from certain packages, etc.

One caveat, however: security model is not much help for other harmful things such as recursion or infinite loops. The idea, however, is to solve those as follows (about which we’ll talk in a later post):

Recursion: by using static analysis on the loaded code
Loops: either by doing the same and prohibiting loops completely, or, if undesirable, by observing which scripts run longer than some time and blacklisting those that exceed the threshold.

Let’s look at some examples:

Here is an Ad targeting US and one of the segments, “386:fp:236” or “387:fp:236”.

Quick note: “fp” here indicates it is a first-party segment; for more information on how these are created and managed securely in a multi-tenant environment, see Nanoput and other articles on our DMP elsewhere on this blog.
A creative for this ad is a 300×250 banner. Notice the convention of how the ID of the Ad is included in the name of the Tag. This is for demonstration purposes, and thus a lot of code is unrolled here; in reality, methods like getClickRedir() or getTagTemplate() would just be used from the TagImpl class

Finally, BidPriceCalculator (you can see it is defined in the Ad). As can be seen, it uses a formula based on a segment score and a viewability score from Integral to come up with a price. (This, of course, is just an example.) Notice that the information about the user’s score, which is part of the DS-built model, and Viewability score are part of the augmented request that is the argument to getBidPrice() method. In other words, as noted above, the code here needs to just execute a formula, and all the data will be provided to it; this allows us to sandbox this code for safety while allowing for flexibility.

Pretty cool, if I may say so myself.

P.S. You may feel free to contrast this approach with Xandr’s Bonsai.

Music of the post: Feeling groovy

OpenDSP – stay tuned

April 7, 2014December 9, 2017 Gregory Golberg opendsp ad tech, dsp, multi-tenancy, opencmo, opendsp, rtb, saas

Every marketer, it seems, wants to participate in real-time bidding (RTB). But what is it that they really want?

They want an ability to price (price, not target!) a particular impression in real-time. Based on the secret-sauce business logic and data science. Fair enough.

But that secret sauce, part of their core competence, is just the tip of the iceberg — and the submerged part is all that is required to keep that tip above water. To wit:

Designing, developing, testing and maintaining actual code for the UI for targeting, the bidder for bidding, reporting, and data management
Scaling and deploying such code in some infrastructure (own data center,
clouds like AWS, GCE, Azure), etc.
Integrating with all exchanges of interest, including the following steps:
- Code: passing functional tests (understanding the exchange’s requirements for parsing request and sending response)
- Infrastructure: ensuring the response is being sent to the exchange within the double-digit-millisecond limit
- Scaling: As above, but under real load (hundreds of thousands of queries per second)
- Business: Paperwork to ensure seat on the exchange, including credit agreements when necessary
Operations: Ongoing monitoring of the operations, including technical (increased latency) and business (low fill level, high disapproval level) concerns (whether these concerns are triggered by clients, exchange partners or,
ideally, pro-actively addressed internally.

None of which is their core competence. We propose to address the underwater part. It’ll be exciting.

Enter OpenDSP. We got something cool coming up here. Stay tuned.

Watch this space, and by this space I mean this blog, this GitHub account.

DEBEDb

Fear, loathing, uncertainty, doubt, laziness, impatience, Oxford commas, and hubris

opendsp