Some more random notes

  • Completely agree with @mipsytipsy here:

    I am an extremely literal person, and literally speaking, nobody can be a “full stack” engineer. The idea is a ridiculous one. There’s too much stack! But that’s not what people mean when they say it. They mean, “I’m not just a frontend or backend engineer. I span boundaries.”
  • Yeah, this blog is for bragging.
  • What is it with fillable PDFs on some gov’t websites (I know, I know; that’s a post for a different day) — but they can be sometimes saved but not printable?
  • TFW about 16 years later after your colleague writes an impassioned call to “Tear down that GIL!” (take that, Mr. Gorbachev!), the GIL is finally torn down.
  • I was wondering what the Go team was smoking when they came up with the reference date concept and can I have some of that?
  • What is it that causes Medium to suck so much? Is it all the useless “content creators” writing things pre-GPT that just rephrase stuff from the Internet with nary a value added (“Here’s 10 reasons to learn Python”, and here’s how to write “Hello, world” in C, did you know?)? Is it that now probably thousands more are using generative AI — kinda indistinguishable? Or is it their idiotic subscription model which cannot deal with some logins? (I should really devote time to figure out that one but why — is that platform really worth anything at all?)

Adventures with Golang dependency injection

Just some notes as I am learning this… There aren’t good answers here, mostly questions (am I doing it right?). All these examples are a part of (not very well) organized GitHub repo here.

Structure injection

Having once hated the magic of Spring’s DI, I’ve grown cautiously accustomed to the whole @Autowired stuff. When it comes to Go, I’ve come across Uber’s Fx framework which looks great, but I haven’t been able to figure out just how to automagically inject fields whose values are being Provided into other structs.

An attempt to ask our overlords yielded something not very clear.

I finally broke down and asked a stupid question. Then I found the answer — do not use constructors, just use fx.In in combination with fx.Populate(). Finally this works. But doesn’t seem ideal in all cases…

Avoiding boilerplate duplication

This is all well and good, but not always. For example, consider this example in addition to the above:

package dependencies

import "go.uber.org/fx"

type Foo string

type Bar string

type Baz string

type DependenciesType struct {
	fx.In

	Foo Foo
	Bar Bar
	Baz Baz
}

func NewFoo() Foo {
	return "foo"
}

func NewBar() Bar {
	return "bar"
}

func NewBaz() Baz {
	return "baz"
}

var Dependencies DependenciesType

var DependenciesModule = fx.Options(
	fx.Provide(NewFoo),
	fx.Provide(NewBar),
	fx.Provide(NewBaz),
)

If I try to use it as dependencies.Dependencies, it’s ok (as above). But if I rather want to get rid of this var, and rather use constructors. But I don’t like the proliferation of parameters into constructors. I can use Parameter objects but I’d like to avoid the boilerplate of copying fields from the Parameter object into the struct being returned, so I’d like to use reflection like so (generics are nice):

package utils

import "reflect"

func Construct[P any, T any, PT interface{ *T }](params interface{}) PT {
	p := PT(new(T))
	construct0(params, p)
	return p
}

func construct0(params interface{}, retval interface{}) {
	// Check if retval is a pointer
	rv := reflect.ValueOf(retval)
	if rv.Kind() != reflect.Ptr {
		panic("retval is not a pointer")
	}

	// Dereference the pointer to get the underlying value
	rv = rv.Elem()

	// Check if the dereferenced value is a struct
	if rv.Kind() != reflect.Struct {
		panic("retval is not a pointer to a struct")
	}

	// Now, get the value of params
	rp := reflect.ValueOf(params)
	if rp.Kind() != reflect.Struct {
		panic("params is not a struct")
	}

	// Iterate over the fields of params and copy to retval
	for i := 0; i < rp.NumField(); i++ {
		name := rp.Type().Field(i).Name
		field, ok := rv.Type().FieldByName(name)
		if ok && field.Type == rp.Field(i).Type() {
			rv.FieldByName(name).Set(rp.Field(i))
		}
	}

}

So then I can use it as follows:

package dependencies

import (
	"example.com/fxtest/utils"
	"go.uber.org/fx"
)

type Foo *string

type Bar *string

type Baz *string

type DependenciesType struct {
	Foo Foo
	Bar Bar
	Baz Baz
}

type DependenciesParams struct {
	fx.In
	Foo Foo
	Bar Bar
	Baz Baz
}

func NewFoo() Foo {
	s := "foo"
	return &s
}

func NewBar() Bar {
	s := "bar"
	return &s
}

func NewBaz() Baz {
	s := "foo"
	return &s
}

func NewDependencies(params DependenciesParams) *DependenciesType {
	retval := utils.Construct[DependenciesParams, DependenciesType](params)
	return retval
}

var DependenciesModule = fx.Module("dependencies",
	fx.Provide(NewFoo),
	fx.Provide(NewBar),
	fx.Provide(NewBaz),

	fx.Provide(NewDependencies),
)

But while this takes care of proliferating parameters in the constructor as well as the boilerplate step of copying, I still cannot avoid duplicating the fields between DependenciesType and DependenciesParams, running into various problems.

Looks like this is still TBD on the library side; I’ll see if I can get further.

Conditional Provide

When using constructors, I would have a construct such as:

type X struct {
   field *FieldType
}

func NewX() *X {
   x := &X{}
   if os.Getenv("FOO") == "BAR" {
     x.field = NewFieldType(...)
   }
}

In other words, I wanted field to only be initialized if some environment variable is set. In transitioning from using constructors to fx.Provide(), I wanted to keep the same functionality, so I came up with this:

type XType struct {
   fx.In

   field *FieldType `optional:"true"`
}

var X XType

func NewX() *X {
   x := &X{}
   if os.Getenv("FOO") == "BAR" {
     x.field = NewFieldType(...)
   }
}

var XModule = fx.Module("x",
	func() fx.Option {
		if os.Getenv("FOO") == "BAR" {
			return fx.Options(
				fx.Provide(NewFieldType),
			)
		}
		return fx.Options()
	}(),
	fx.Populate(&X),

Works fine. But is it the right way?


Putting on my marketing hat: Random MailJet hack

(Yes, I do indeed wear multiple hats — marketing FTW, or is it WTF?)

Really wanted to use MailJet (BTW, guys, what’s with support and not being able to edit a campaign after launch? Get what I pay for?) to send users a list of items, dynamically. For example, say I have the following list of items, say, “forgotten” in a shopping cart:

User,Items
Alice,”bat, ball”
Bob,”racquet, shuttlecock”


And I want to send something like (notice I’d also like links there):

Hey Alice, did you forget this in your cart:

Ball
Bat

Turns out, loop construct doesn’t work. Aside: this is despite ChatGPT’s valiant attempt to suggest that something like this could work:

{{
{% set items = data.items|split(‘,’) %}
{% for item in items %}
{{ item.strip() }}
{% endfor %}
}}


But the answer is hacky but works. Because if you remember that SGML is OK with single quotes, then construct your contacts list like so:

User,Items

Alice,"<a href='https://www.amazon.com/BB-W-Wooden-baseball-bat-size/dp/B0039NKEZQ/'>bat</a>,<a href='https://www.amazon.com/Rawlings-Official-Recreational-Baseballs-OLB3BBOX3/dp/B00AWVNPMM/'>ball</a>
Bob,"<a href='https://www.amazon.com/YONEX-Graphite-Badminton-Racquet-Tension/dp/B08X2SXQHR/'>racquet</a>, <a href='https://www.amazon.com/White-Badminton-Birdies-Bedminton-Shuttlecocks/dp/B0B9FPRHBF'>shuttlecock</a>"




And make the HTML be just

Hey [[data:user:””]],

You did you forget this in your cart?
[[data:items:””]]


Works!

P.S. Links provided here are just whatever I found on Google, no affiliate marketing.

Fun with ChatGPT

So prompt engineering is all the rage (a friend of mine even wrote a bookyou magnificent bastard, I haven’t read it yet) — and I have a problem with trying to export GCP configuration for use with Deployment manager. Specifically, how to get CloudRun to work.

So an impressive session (and another one) result with some cool reasoning by ChatGPT in response to me pointing out that things don’t quite work (yes, Amigo Grady, I know it’s not reasoning)… Except that they don’t work.

Which I finally find via StackOverflow — because this feature is unsupported, per Google. But this info is probably outside the training data window of our esteemed LLM.

What’s the moral of the story? There is no moral — use it as a tool. Just a tool.

More random notes

  • Yeah, no-code/low-code is great (wave, OpenAI). Especially for growth-hacking, right (hello, Butcher)? But here’s your no-code platform — Google Ads. Gawd… I’d rather write code.
  • Why does FastAPILoggingHandler seems to ignore my formatter? I don’t know; but the fact that someone else also spends time figuring out the inane things that should just work is quite frustrating.
  • How many yaks have you shaved today?
  • O, GCP, how convenient: in the env var YAML sent to gcloud run you helpfully interpret things like “on”/”off”, “true”/”false”, “yes”/”no” as numbers, eh? And then you crash with:

    ERROR: gcloud crashed (ValidationError): Expected type <class 'str'> for field value, found True (type <class 'bool'>)

    Because of course you do.
  • “Overriding a number of default settings is key to shaving off unnecessary spend”. Yep.

Validation?

I recently learned about MediaMath’s custom brain. Excited for the validation of OpenDSP‘s concept of custom bidding logic. But it is a bit limited, being just a polynomial. MediaMath’s custom bid router provides way more flexibility — but you need your own infrastructure! So — I still think our approach — DSL-based scripting — is better, because it combines both!

Random notes for January 2023

Not enough for any singular entry, but enough to write a bunch of annoyed points. Because I hate Twitter threads and this is the reverse: unconnected entries jammed together.

  • GoLang: Looks like the answers to my questions are nicely written up here.
  • Technology and Society: Ok, I promise I’ll get to geekery here. So, PeopleCDC folks seem upset about the New Yorker article. But now, I am surprised — and maybe it is an oversight — at the lack of inclusion of IT people in the form. Artists, yeah, to carry the message — but if the goal is to slow the spread, why no consideration given to automation of various things (look at how pathetic most government websites are for things that are routine).

    Not expecting to hear back, really.
  • Google Ads and API Management: Every time you think you get used to all the various entities in Google Ads, you realize there’s of course a sunsetting of UA … Of course! Of course this is where I pause and let Steve Yegge on with his rant:

    Dear RECIPIENT,

    Fuck yooooouuuuuuuu. Fuck you, fuck you, Fuck You. Drop whatever you are doing because it’s not important. What is important is OUR time. It’s costing us time and money to support our shit, and we’re tired of it, so we’re not going to support it anymore. So drop your fucking plans and go start digging through our shitty documentation, begging for scraps on forums, and oh by the way, our new shit is COMPLETELY different from the old shit, because well, we fucked that design up pretty bad, heh, but hey, that’s YOUR problem, not our problem.

    We remain committed as always to ensuring everything you write will be unusable within 1 year.

  • API Management, ListHub: First, I learned there’s a standards body Unsure what I’m making of it (I mean, I suppose I’ve gotten good results from IAB, and standardization of FinOps is somewhat ongoing, so, er, maybe not all bureaucracy is an awful horrible crap.

    But that’s kind of a side note.

Refreshing Golang

Goroutines — I feel dumb

Doing a Golang refresher. Realize I still do not understand how exactly a new thread is spun when a syscall happens, or what happens to M and P when we are waiting on a channel? What does it mean that “Every M must be able to execute any runnable G” — that is, what does the word “execute” mean here? This document says so below again: “When an M is willing to start executing Go code, it must pop a P form the list. When an M ends executing Go code, it pushes the P to the list.” What is “ends executing”?

Similarly here, what does it mean “M will skip the G”? How does it “skip it” if thread M is running G’s instructions now? Doesn’t it block with the blocking G? What am I missing?

OK, so let’s say in case of I/O, it’s due to netpoller magic:

Whenever you open or accept a connection in Go, the file descriptor that backs it is set to non-blocking mode. This means that if you try to do I/O on it and the file descriptor isn’t ready, it will return an error code saying so. Whenever a goroutine tries to read or write to a connection, the networking code will do the operation until it receives such an error, then call into the netpoller, telling it to notify the goroutine when it is ready to perform I/O again. The goroutine is then scheduled out of the thread it’s running on and another goroutine is run in its place.

When the netpoller receives notification from the OS that it can perform I/O on a file descriptor, it will look through its internal data structure, see if there are any goroutines that are blocked on that file and notify them if there are any. The goroutine can then retry the I/O operation that caused it to block and succeed in doing so.

But still unclear — is it netpoller itself that schedules G out of M? How does M stop running G and starts running some other G’? And what about blocking on a channel operation?

Per this post, this “scheduling out” is done by runtime:

When M executes a certain G, if a syscall or other blocking operations occur, M will block. If there are some Gs currently executing, the runtime will remove the thread M from P, and then create a new thread.

In Go: Goroutine, OS Thread and CPU Management, Vincent describes this as

Go optimizes the system calls — whatever it is blocking or not — by wrapping them up in the runtime. This wrapper will automatically dissociate the P from the thread M and allow another thread to run on it.

The “wrapper” seems to be the netpoller (see above). Ok, I suppose all of this connects, in a somewhat handwavy enough wave, that I’m almost satisfied. I feel like it’s just a couple of dots that are unconnected though, still… I suppose we can stipulate syscalls, but how are channel blocks handled? Is the same mechanism done but via channels, rather than the netpoller in that case?

Some good deeper-than-usual resources

Some refresher examples for myself

As I was refreshing my Golang, I made a bunch of snippets as a kind of study cards. Nothing sophisticated here, just basics. Yesh, it’s like Go By Example, but writing them myself is better for remembering. (Kinda like lecture notes, except, well, you can’t do these with pen and paper…)

Patterns: descriptivism vs prescriptivism

This is going to be so short, it requires this sentence to say so so it appears a bit longer.

It seems that there are two ways of looking at them. Prescriptive: “When faced with a problem of class X, use pattern A”. Or descriptive, “When faced with a problem of class X, a lot of times engineers use approaches Alpha, Beta, Gamma that have a particular pattern in common; let’s extract it and call it A so we have a common terminology.”

The “prescriptive” part really should be a “strong suggestion” added weight to by the fact that it is widespread enough to get a name, but nothing beyond that. (See also “Thinking outside the box“).

What prompted this? Well, TIL that exercises such as Ad hoc querying on AWS have a name: “lakehouse“, and that I’ve apparently been thinking about how best to do “Reverse ETL” without thinking “Reverse ETL”. Well, I guess that’s open source marketing.

This post is not making any prescriptions.

Today’s yak shaving

So I wanted to catch up on Go. I like my IDEs, so I got my Go 1.19.2, my GoClipse 0.16.1, and off to remember how to ride this bicycle again. Or rather ride this yak, because…

I like IDEs mostly for two things: interactive, graphical debugging (for that see the P.S. section) and content assist. And we have a problem:
The solution seems to be to install a fork of gocode. Ok, I got content assist now. On to trying the next sine qua non of IDE, interactive debugger. Wait… There’s no GDB on M1? Yikes.

For debugger, sure, I could use LLDB, as suggested, or, even better, Delve, but I want it in the IDE (how many times do I have to say that, in 2022?) A few GUI frontends exist, but I am partial to Eclipse platform (I can be as picky with my tools as any tradesman).

Is it time to dust off Kabuta, an abandoned attempt to adapt Delve API to GDB/MI interface that GoClipse speaks? Ok why not, let’s go get that sucker… Done.

But how do I get it into GoClipse? A StackOverflow post suggests it is done via “New Project wizard”, but not so fast, Mr. Yak Jockey:
Ok, let’s hack around it but first let’s file an issue on GoClipse… Wait, say it ain’t so… GoClipse is no longer maintained? My world collapses. To Bruno’s point that:

I’ve seen the issues Eclipse has, both external (functionality, UI design, bugs), and internal (code debt, shortage of manpower). I was holding on to the hope it would recover, but now I don’t see much of a future for Eclipse other than a legacy existence, and I realize exiting was long overdue.

I just a few months ago got fed up with how much of a torture it is to step through the code in IDEA (and no waving of dead chickens seems to help) and went back to Eclipse for Java.
Do I fork GoClipse?
P.S. The workaround for the import is to use default ~/go as GOPATH. In other words, using /Users/gregory/src/github.com/debedb/kabuta worked fine:

Few gists

Just a few gists to park here for later reference.

Signing oAuth 1.0 request

To work with Twitter Ads API, need to use OAuth 1.0. There’s a nice little snippet of Java here, but there’s an issue with it. After chasing some red herrings due to Postman collections, the problem is that query string is not properly encoded. Fixed code to do that for query parameters (while still missing params from request body because I don’t need them now) and adding nonce generation at this gist.

Maven dependencies diff

Having run into problems where , here is a pom-compare.py script that compares two pom.xml files giving the difference in dependencies. Given two files — current that may have a problem, and one from a known good project — this script will show which dependencies in the problematic file may be older than needed, or are entirely missing.

Generic code for Google Ads API query

Using Google Ads API involves a lot of code that follows certain patterns (mutates, operations, builders, etc). As a fan of all things meta, including reflection, I just had to make a generic example of doing that. So now a parameterized invocation can be used in place of, for example, both campaign or ad group creation code. A similar approach can be done for update, of course.

Mocking DB calls

Using Mockito, it is quite easy to mock up a set of DB calls. For example, here’s a mock ResultSet backed by a Map.

Jar diff

Who hasn’t needed to diff JARs? Thanks to procyon, this is easy.

FinOps

Some time ago a discussion about CIO vs CMO as it comes to ad tech started, and as I see it, it still continues. As a technical professional in ad tech space, I followed it with interest.

As I was building ad tech in the cloud (which usually involves large scale — think many millions QPS), business naturally became quite cost-conscious. It was then when, I, meditating on the above CIO-CMO dichotomy, thought that perhaps the next thing is the CIO (or the CTO) vs — or together with — the CFO.

What if whether to commit cloud resources (and what kind of resources to commit) to a given business problem is dictated not purely by technology but by financial analysis? E.g., a report is worth it if we can accomplish it using spot instances mostly; if it goes beyond certain cost, it is not worth it. Etc.

These are all very abstract and vague thoughts, but why not?

Recently I learned of an effort that seems to more or less agree with that thought — the FinOps foundation, so I am checking it out currently.

Sounds interesting and promising so far.

And nice badge too.

FinOps-Foundation-Community-Member-Badge

Ad-hoc querying on AWS: Connecting BI tools to Athena

In a previous post, we discussed using Lambda, Glue and Athena to set up queries of events that are logged by our real-time bidding system. Here, we will build on that foundation, and show how to make this even friendlier to business users by connecting BI tools to this setup.

Luckily, Athena supports both JDBC and ODBC, and, thus, any BI tool that uses either of these connection methods can use Athena!

First, we need to create an IAM user. The the minimum policies required are:

  • AmazonAthenaFullAccess
  • Writing to a bucket for Athena query output (use an existing one or create a new one). For the sake of example, let’s call it s3://athena.out
  • Reading from our s3://logbucket which is where the logs are in

Now we’ll need the access and the secret keys for that user to use it with various tools. 

JDBC

JDBC driver (com.simba.athena.jdbc.Driver) can be downloaded here.

The JDBC URL is constructed as follows:

jdbc:awsathena://User=<aws-access-key>;Password=<aws-secret-key>;S3OutputLocation=s3://athena.out;

Here’s a sample Java program that shows it in action:

import java.sql.Connection;
import java.sql.DriverManager;
 
public class Main {
  public static void main(String[] args) throws Throwable {
    Class.forName("com.simba.athena.jdbc.Driver");
    String accessKey = "...";
    String secretKey = "...";
    String bucket = "athena.out";
    String url = "jdbc:awsathena://AwsRegion=us-east-1;User=" + accessKey + ";Password=" + secretKey
        + ";S3OutputLocation=s3://" + bucket +”;";
    Connection connection = DriverManager.getConnection(url);
    System.out.println("Successfully connected to\n\t" + url);
  }
}

Example using JDBC: DbVisualizer

  1. If you haven’t already, download the JDBC driver to some folder.
  2. Open Driver Manager (Tools-Driver Manager)
  1. Press green + to create new driver
  1. Press the folder icon on the right …
  1. … and browse to the folder where you saved the JDBC driver and select it:
  1. Leave the URL Format field blank and pick com.simba.athena.jdbc.Driver for Driver Class:
  1. Close the Driver Manager, and let’s create a Connection:
  1. We’ll use “No Wizard” option. Pick Athena from the dropdown in the Driver (JDBC) field and enter the JDBC URL from above in the Database URL field:
  2. Press “Connect” and observe DbVisualizer read the metadata information from Athena (well, Glue, really), including tables and views.

ODBC (on OSX)

  1. Download run ODBC driver installer
  2. Create or edit /Library/ODBC/odbcinst.ini to add the following information:
    [ODBC Drivers]
    Simba Athena ODBC Driver=Installed
    [Simba Athena ODBC Driver]
    Driver = /Library/simba/athenaodbc/lib/libathenaodbc_sbu.dylib

    If the odbcinst.ini file already has entries, put new entries into the appropriate sections; e.g., if it was
    [ODBC Drivers]
    PostgreSQL Unicode = Installed
    [PostgreSQL Unicode]
    Description = PostgreSQL ODBC driver
    Driver = /usr/local/lib/psqlodbcw.so

    Then it becomes
     [ODBC Drivers]
    PostgreSQL Unicode = Installed
    Simba Athena ODBC Driver=Installed
    [PostgreSQL Unicode]
    Description = PostgreSQL ODBC driver
    Driver = /usr/local/lib/psqlodbcw.so
    [Simba Athena ODBC Driver]
    Driver = /Library/simba/athenaodbc/lib/libathenaodbc_sbu.dylib
  3. Create or edit, in a similar fashion, /Library/ODBC/odbc.ini to include the following information:
    [AthenaDSN]
    Driver=/Library/simba/athenaodbc/lib/libathenaodbc_sbu.dylib
    AwsRegion=us-east-1
    S3OutputLocation=s3://athena.out
    AuthenticationType=IAM Credentials
    UID=AWS_ACCESS_KEY
    PWD=AWS_SECRET_KEY
  4. If you wish to test, download and run ODBC Manager. You should see that it successfully recognizes the DSN:

Example using ODBC: Excel

  1. Switch to Data tab, and under New Database Query select From Database:
  1. In the iODBC Data Source Chooser window, select AthenaDSN we configured above and hit OK. 
  1. Annoyingly, despite having configured it, you will be asked for credentials again. Enter the access and secret key.
  2. You should see a Microsoft Query window. 

Success!

Helpful links

Ad-hoc querying on AWS: Lambda, Glue, Athena

Introduction

If you give different engineers the same problem they will usually produce reasonably similar solutions (mutatis mutandis). For example, when I first came across a reference implementation of an RTB platform using AWS, I was amused by how close it was to what we have implemented in one of my previous projects (OpenRTB).

So it would be not much of a surprise that in the next RTB system, a similar pattern was used: logs are written to files, pushed to S3, then aggregated in Hadoop from where the reports are run.

But there were a few problems in the way… 

Log partitioning

The current log partitioning in S3 is by server ID.  This is really useful for debugging, and is fine for some aggregations, but not really good for various reasons – it is hard to narrow things by date, resulting in large scans. It is, therefore, even harder to do joins. Large scans in tools like Athena also translate into larger bills. In short, Hive-like partitioning would be good. 

To that end, I’ve created a Lambda function, repartition, which is triggered when a new log file is uploaded to s3://logbucket/ bucket:

import boto3
from gzip import GzipFile
from io import BytesIO
import json
import urllib.parse

s3 = boto3.client('s3')

SUFFIX = '.txt.gz'

V = "v8"

def lambda_handler(event, context):
    #print("Received event: " + json.dumps(event, indent=2))

    # Get the object from the event and show its content type
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
    try:
        s3obj = s3.get_object(Bucket=bucket, Key=key)
        src = { 'Bucket': bucket, 'Key': key }
        (node, orig_name) = key.split("/")
        (_, _, node_id) = node.split("_")
        name = orig_name.replace(SUFFIX, "")
        (evt, dt0, hhmmss) = name.split("-")
        hr = hhmmss[0:2]
        # date-hour
        dthr = f"year=20{dt0[0:2]}/month={dt0[2:4]}/day={dt0[4:6]}/hour={hr}"
        
        schema = f"{V}/{evt}/{dthr}"
        dest = f"{schema}/{name}-{node_id}{SUFFIX}"
        print(f"Copying {key} to {dest}")
        s3.copy_object(Bucket=bucket, Key=dest, CopySource=src)

        return "OK"
    except Exception as e:
        print(e)
        print('Error getting object {} from bucket {}. Make sure they exist and your bucket is in the same region as this function.'.format(key, bucket))
        raise e

if __name__ == "__main__":
    # Wrapper to run from CLI for now
    s3entry = {'bucket' : {'name' : 'logbucket'},
               'object' : {'key'  : server/requests-200420-003740.txt.gz'}}
    event = {'Records' : [{'s3' : s3entry}] }
    lambda_handler(event, None)
    

At that time, the log is copied to a new path is created under v8 prefix, with the following pattern:

<event>/year=<year>/month=<month>/day=<day>/hour=<hour>/<filename>-<nodeID>. For example, 

s3://logbucket/server1234/wins-200421-000003.txt.gz

is copied to 

s3://logbucket/v8/wins/year=2020/month=04/day=21/hour=00/wins-200421-000003-1234.txt.gz

(The “v8” prefix is there because I have arrived at this schema having tried several versions — 7, to be exact).

What about storage cost?

  • An additional benefit of using date-based partitioning is that we can easily automate changing storage type to Glacier for folders older than some specified time, which will save S3 storage costs of the duplicate data.
  • In the cloud, the storage costs are not something to worry about; the outgoing traffic and compute is where the problem is at. So move the compute to the data, not the other way around.

NOTE: Partitions do not equal timestamps

Partitioning is based on the file name. Records inside the file may have timestamps whose hour one greater than or one less than the partition hour, for obvious reasons. Thus, partitions are there to reduce the number of scanned records, but care should be taken when querying to not assume that the timestamps under year=2020/month=04/day=21/hour=00 all are in the 0th hour of 2020-04-21.

Discover metadata in Glue

Glue is an ETL service on AWS. One of the great features of Glue is crawlers that attempt to glean metadata from the logs. This is really convenient because it saves us the tedious step of defining the metadata manually.

So we set up a crawler. For explanation on how to do it, see the “Tutorial” link on the left hand side of this page:

However, it takes some time to get the configuration correct. 

  1. We would want to exclude some logs because we know their format is not good for discovery (they are not straight-up JSON or CSV, etc; and, at the moment, custom SerDes are not supported by Athena) — but see below for exceptions. This is done in the “Data store” part of crawler configuration:
  1. We want Glue to treat new files of the same type as being different partitions of the same table, not create new ones. For example, given the partitioning convention we created above, these two paths:
  • s3://logbucket/v8/wins/year=2020/month=04/day=21/hour=00/wins-200421-000003-1234.txt.gz
  • s3://logbucket/v8/wins/year=2020/month=04/day=22/hour=11/wins-200422-000004-4321.txt.gz

 Should be treated as partitions of table “wins”, not two different tables. We do this on the “Output” section of the crawler configuration as follows:

Once the crawler runs, we will see a list of tables created in Glue.

If here we see tables looking like parts of the partitioned path (e.g., year=2020, or ending with _txt_gz), it means the crawler got confused when discovering their structure. We are adding those to the crawler’s exclusion list, and will create their metadata manually. Fortunately, there are not that many such logs, and we can exclude them one by one.

Of course, while the crawler can recognize the file structure, it doesn’t know what to name the fields. So we can go and name them manually. While this is a tedious process, we don’t have to do that all at once – just do it on the as-needed basis. 

We will want to keep the crawler running hourly so that new partitions (which get created every hour) are picked up. (This can also be done manually from Athena – or Hive — by issuing MSCK REPAIR TABLE command).

First useful Athena query

Looking now at Athena, we see that metadata from Glue is visible and we can run queries:

Woohoo! It works! I can do nice ad hoc queries. We’re done, right?

Almost. Unfortunately, for historical reasons, our logs are not always formatted to work with this setup ideally.

We could identify two key cases:

  1. Mostly CSV files:
    • There are event prefixes preceding the event ID, even though the event itself is already defined by the log name. For example, bid:<BID_ID>, e.g.:
      bid:0000-1111-2222-3333
    • A CSV field in itself contains really two values. E.g., a log that is comma-separated into two fields: timestamp and “message”, which includes “Win: ” prefix before bid ID, and then – with no comma! – “price: ” followed by price. Like so:
      04/21/2020 00:59:59.722,Win: a750a866-8b1c-49c9-8a30-34071167374e_200421-00__302 price:0.93
      However, what we want to join on is the ID. So in these cases, we can use Athena views. For example, in these respective cases we can use:
       CREATE OR REPLACE VIEW bids2 AS
      SELECT "year", "month", "day", "hour",
      "SUBSTRING"("bid_colon_bid_id", ("length"('bid:') + 1)) "bid_id"
      FROM bids

      and
       CREATE OR REPLACE VIEW bids2 AS
      SELECT "year", "month", "day", "hour",
      "SPLIT_PART"("message", ' ', 3) "bid_id"
      FROM wins

      Now joins can be done on bid_id column, which makes for a more readable query.
  2. The other case is a log that has the following format: a timestamp, followed by a comma, followed by JSON (an OpenRTB request wrapped in one more layer of our own JSON that augments it with other data). Which makes it neither CSV, nor JSON. Glue crawler gets confused. The solution is using RegEx SerDe, as follows:

    And then we can use Athena’s JSON functions to deal with the JSON column, for example, to see distribution of requests by country:
     SELECT JSON_EXTRACT_SCALAR(request, '$.bidrequest.device.geo.country') country,
    COUNT(*) cnt
    FROM requests
    GROUP BY
    JSON_EXTRACT_SCALAR(request, '$.bidrequest.device.geo.country')
    ORDER BY cnt DESC
    Success! We can now use SQL to easily query our event logs.

Helpful links

I wish I didn’t have to: Decompiling Java

“I wish I didn’t have to” sounds like a category on its own, sort of like a complement to The Daily WTF. But, alas, I do — I need to understand what the differences are between two JARs, and I haven’t looked into Java decompilers since JAD (goes to show… something). But after some brief googling — Procyon rules! So here’s a script to do JAR diffs.

P.S. The humility of the author in designating a pretty good product as a 0 version (a propos — this is silly) led me to dig further into this project and it is cool stuff. Look forward to digging some more.

A look at app-ads.txt

Introduction

App-ads.txt is a follow-up to IAB’s ads.txt initiative aimed at increasing transparency in the programmatic ad marketplace. Related to it are a number of other initiatives and standards such as supply chain, payment chain, sellers.json, various things coming out of TAG group, etc. See last section for relevant links.

TL;DR: This standard allows an ad inventory buyer (DSP) to verify whether the entity selling the inventory (SSP) is authorized by the publisher to sell it. This is done by looking up the the publisher’s URL for an app (bundle) in the app store, and examining app-ads.txt file at that URL. The file lists authorized sellers of the publisher’s inventory, along with an ID that this publisher should be identified by in the reseller’s system (publisher.id in OpenRTB).

Terminology

Here, the words “publisher” and “developer” may be used interchangeably; ditto for “app” and “bundle”.

First pass implementation

The algorithm is fairly simple: for each app – aka bundle – of interest, grab the app publisher’s domain from the appropriate app store, fetch app-ads.txt file from that domain and parse it. But of course, in theory there is no difference between theory and practice, but in practice there is. In reality, there are some deviations from the standard and exceptional cases that had to be taken care of in the process.

As a first pass, we are running this process semi-manually; if the results warrant full automation, this can easily be accomplished. Here is what was done (more technical details can be found on GitHub):

  1. Bundle IDs and other information was retrieved from request log for the last few days (really, requests for all bids in May as of end of May 6), using a query to Athena. This comes to a total of:
    1. 1,144,377 bids
    2. 613,241 unique bundle IDs.
    3. 685,221 bundle-SSP combinations
  2. The result set, exported as CSV file is loaded into SQLite DB. The same SQLite DB is also used for caching of results.
  3. Go through the list of bundles with a Python script, appadsparser.py.While ads.txt standard provides the specification of how app stores should provide publisher information, only Google Play Store currently follows it. For Apple’s App Store, we crawl its search API’s lookup service, and, if not found there, directly the App Store’s page (though this appears to be a violation of robots.txt).
  4. The result is a semi-structured log. The summary is below.

Results Notes

Result codes

The following are result codes and their meaning:

  • OK – The SSP we got the traffic on is authorized by the bundle publisher’s app-ads.txt for the publisher ID specified in request
  • OK_GOOGLE – Google is authorized by the bundle publisher’s app-ads.txt. See below on explanation of why Google is special.
  • PROBLEMS — Mismatch found between authorized SSPs and/or publisher IDs in app-ads.txt.
  • NO_APPADS_TXT – Publisher’s website has no app-ads.txt file
  • NOT_FOUND_IN_PLAY_STORE – bundle not found in Google Play Store. NOT_FOUND_IN_ITUNES – same for Apple App Store.
  • NO_URL_IN_PLAY_STORE – cannot find publisher’s URL in Google Play Store. NO_URL_IN_ITUNES – same for Apple App Store.
  • FACEBOOK_URL – publisher’s URL points to a Facebook page (this happens often enough that it warrants its own status) BAD_DEV_URL – publisher’s URL in an app store is invalid
  • BAD_BUNDLE_ID – store URL (from the OpenRTB request) is invalid, and cannot be determined from the bundle ID either

    Note on Google

    NOTE: At the moment, there is no way to check the publisher ID for Google due to an internal issue. In other words, we cannot verify the following part of the spec:

    Field #2 - Publisher’s Account ID - This must contain the same value used in transactions (i.e. OpenRTB bid requests) in the field specified by the SSP/exchange. Typically, in OpenRTB, this is publisher.id.
    

    Given Google’s aggressive anti-fraud enforcement, we can for now stipulate that it would not run unauthorized inventory. There is still the possibility of fraud, of course. But in the below table we distinguish between bundles served via Google (where we do not check for publisher ID, just the presence of Google in app-ads.txt) and those served via other SSPs, where we cross-reference the publisher ID.

    As a corollary of the above you will not see Google inventory under “PROBLEMS” status.

    This gives a good sample:

    Result code Count
    OK 5,693
    OK_GOOGLE 29,330
    PROBLEMS 361
    NO_APPADS_TXT 7,653
    NOT_FOUND_IN_PLAY_STORE 4,417
    NOT_FOUND_IN_ITUNES 134
    NO_URL_IN_PLAY_STORE 20,112
    NO_URL_IN_ITUNES 16,168
    FACEBOOK_URL 1,378
    BAD_DEV_URL 4

    Summary of findings

    It appears that due to not very high adoption of this standard at present (developer URLs not present in App Stores or app-ads.txt file not present on the developer domain), there is not much utility to it at the moment. However, as the adoption rate is increasing (see below), this is worth revisiting again. Also, consider that for this exercise we only used bids information – that is, not the sample of full traffic, but just what we have bid on. This may not be representative of the entire traffic also, and may be interesting to explore.

    Consider also that the developer may well have the app-ads.txt file on the website, but if the website is not properly listed in the app store, we have no way of getting to it (yet SSPs may include those sites in their overall numbers, see, e.g., MoPub below).

    What does the industry say?

    • Google Play Store is reported to have only ~8% adoption. Worth quoting here is this section of the report: 


      Who Are the Top ‘Direct’ Ad Partners Inside the App-Ads.txt in Google Play Apps?
      Direct ad partners are those which have been granted direct permission by app developers to sell app ad space. That is, they are explicitly listed on the app’s “App-ads.txt” file. Google.com is listed as a direct ad partner on 95.87% of all “App-Ads.txt” files for Google Play apps. This makes it the most frequently mentioned direct ad partner for apps available on the Google Play.

    • Pixalate’s 2019 app-ads.txt trends report is interesting: 
      • Doesn’t seem that app-ads.txt makes that much difference for IVT (invalid traffic) – apps with app-ads.txt have 18.7% of IVT vs 21.1% for those without (pg. 6), despite Pixalate dramatizing this 2.4 percentage point increase as a 13% increase (2.4/18.7 – lies, damn lies and statistics).
      • It lists way higher numbers of adoption for Google Play Store apps than above but that is across top 1K apps.
      • Increasing rates of adoption (~65% rise in Q4 2019 – pg.13)
      • Unity, Ironsource, MoPub, Applovin, Chartboost – in that order – are in the top direct ad partners for Android apps (pg.18); MoPub, Unity, IronSource – in top direct AND resellers for iOS (pg.20)
    • MoPub claims that “app-ads.txt file adoption exceeds 80% for managed MoPub publishers”. It’s unclear what the qualifier “managed” means. Sampling our data, we have issues with app-ads.txt on MoPub about 48% in total, breaking down as follows: 
      • No app-ads.txt: 11%
      • No developer URL found in store: 22.5%
      • Not found in app store: 13.7%

    Not found?

    Additionally, it is worth looking at the bundles that are flagged as NOT_FOUND_IN_PLAY_STORE or NOT_FOUND_IN_ITUNES – how come those cannot be found?

    This does not have to be something nefarious, for example, based on spot-checking, it can be due to:

    • Case-sensitivity. Play Store bundles are case-sensitive but an SSP may normalize them to lower case when sending (e.g., com.GMA.Ball.Sort.Puzzle becomes com.gma.ball.sort.puzzle)
    • Some mishap like non-existent com.rlayr.girly_m_art_backgrounds being sent (but com.instaforall.girly_m_wallpapers exists)

    But even if the majority of the NOT_FOUND errors are due to such discrepancies, not fraud, it means that currently app-ads.txt mechanism itself is de facto not very reliable.

    Some other results

      • Q: Are there any apps that present as different publishers on the same SSP?
      • A: Not too many (about 8.6%), but even so for most important apps it is at most 2-3 – and even at long tail the max is 10 publisher IDs. This, though, can still account for some app-ads.txt PROBLEMs as seen above)

     

    Related documents, standards and initiatives

Swagger as you Go

As someone who offers a REST API, we at Romana project wanted to provide documentation on it via Swagger. But writing out JSON files by hand seemed not just tedious (and thus error-prone), but also likely to result in an outdated documentation down the road.

Why not automate this process, sort of like godoc? Here I walk you through an initial attempt at doing that — a Go program that takes as input Go-based REST service code and outputs Swagger YAML files.

It is not yet available as a separate project, so I will go over the code in the main Romana code base.

NOTE: This approach assumes the services are based on the Romana-specific REST layer on top of Negroni and Gorilla MUX. But a similar approach can be taken without Romana-specific code — or Romana approach can be adopted.

The entry point of the doc tool for generating the Swagger docs is in doc.go.

The first step is to run Analyzer on the entire repository. The Analyzer:

  1. Walks through all directories and tries to import each one (using Import()). If import is unsuccessful, skip this step. If successful, it is a package.
  2. For each *.go and *.cgo file in the package, parse the file into its AST
  3. Using the above, compute docs and collect godocs for all methods

The next step is to run the Swaggerer — the Swagger YAML generator.

At the moment we have 6 REST services to generate documentation for. For now, I’ll explicitly name them in main code. This is the only hardcoded/non-introspectable part here.

From here we, for each service, Initialize Swaggerer and call its Process() method, which will, in turn, call the main workhorse, the getPaths() method, which will, for each Route:

  1. Get its Method and Pattern — e.g., “POST /addresses”
  2. Get the Godoc string of the Handler (from the Godocs we collected in the previous step)
  3. Figure out documentation for the body, if any. For that, we use reflection on the MakeMessage field — the mechanism we used for sort-of-strong-dynamic typing when consuming data, as seen in a prior installment on this topic.

This is pretty much it.

As an example, here is Swagger YAML for the Policy service.

See also

Building on Gorilla Mux and Negroni

The combination of Negroni and Gorilla MUX is a useful combination for buildng REST applications. However, there are some features I felt were necessary to be built on top. This has not been made into a separate project, and in fact I doubt that it needs to — the world doesn’t need more “frameworks” to add to the paralysis of choice. I think it would be better, instead, to go over some of these that may be of use, so I’ll do that here and in further blog entries.

This was borne out of some real cases at Romana; here I’ll show examples of some of these features in a real project.

Overview

At the core of it all is a Service interface, representing a REST service. It has an Initialize() method that would initialize a Negroni, add some middleware (we will see below) and set up Gorilla Mux Router using its Routes. Overall this part is a very thin layer on top of Negroni and Gorilla and can be easily seen from the above-linked source files. But there are some nice little features worth explaining in detail below.

In the below, we assume that the notions of Gorilla’s Routes and Handlers are understood.

Sort-of-strong-dynamic typing when consuming data


While we have to define our route handlers as taking interface{} as input, nonetheless, it would be nice if the handler received a struct it expects so it can cast it to the proper one and proceed, instead of each handler parsing the provided JSON payload.

To that end, we introduce a MakeMessage field in Route. As its godoc says, “This should return a POINTER to an instance which this route expects as an input”, but let’s illustrate what it means if it is confusing.

Let us consider a route handling an IP allocation request. The input it needs is an IPAMAddressRequest, and so we set its MakeMessage field to a function returning that, as in


func() interface{} {
    return &api.IPAMAddressRequest{}
}

Looks convoluted, right? But here is what happens:

The handlers of routes are not called directly, they are wrapped by wrapHandler() method, which will:

  1. Call the above func() to create the pointer to IPAMAddressRequest
  2. Use the unmarshaller (based on Content-Type) to unmarshal the body of the request into the above struct pointer.

Voila!

Rapid prototyping with hooks

Sometimes we prototype features outside of the Go-bases server — as we may be calling out to various CLI utilities (iptables, kubectl, etc), it is easier to first ensure the calls work as CLI or shell scripts, and iterate there. But for some demonstration/QA purposes we still would like to have this functionality available via a REST call to the main Romana service. Enter the hooks functionality.

A Route can specify a Hook field, which is a structure that defines:

  • Which executable to run
  • Whether to run it before or after the Route‘s Handler (When field)
  • Optionally, a field where the hook’s output will be written (which can then be examined by the Handler if the When field is “before”). If not specified, the output will just be logged.

Hooks are specified in the Romana config, and are set up during service initialization (this is in keeping with the idea that no Go code needs to be modified for this prototyping exercise).

Then during a request, they are executed by the wrapHandler() method (it has been described above).

That’s it! And it allows for doing some work outside the server to et it right, and only then bother about adding the functionality into the server’s code.

If this doesn’t seem like that much, wait for further installments. There are several more useful features to come. This just sets the stage.

IDRing

A major feature of the Romana Project is topology-aware IPAM, and an integral part of it is the ability to assign consecutive IP addresses in a block (and reuse freed up addresses, starting with the minimal).

Since IPv4 addresses are essentially 32-bit uint, the problem is basically that of maintaining a sequence of uints, while allowing reuse.

To that end, a data structure called IDRing was developed. I’ll describe it here. It is not yet factored out into a separate project, but as Romana is under Apache 2.0 license, it can still be reused.

  • The IDRing structure is constructed with a NewIDRing() method, that provides the lower and upper bound (inclusive) of the interval from which to give out IDs. For example, specifying 5 and 10 will allow one to generate IDs 5,6,7,8,9,10 and then return errors because the interval is exhausted. 

    Optionally, a locker can be provided to ensure that all operations on the structure are synchronized. If nil is specified, then synchronization is the responsibility of the user.

  • To get a new ID, call GetID() method. The new ID is guaranteed to be the smallest available.
  • When an ID is no longer needed (in our use case — when an IP is deallocated), call ReclaimID().
  • A useful method is Invert(): it returns an IDRing whose available IDs are the ones that are allocated in the current one, and whose allocated ones are the available ones in the current one. In other words, a reverse of an IDRing with min 1, max 10 and taken IDs from 4 to 8 inclusively is an IDRing with the following
    available ranges: [1,3], [9,10].
  • You can see examples of the usage in the test code and in actual IP allocation logic.
  • Persisting it is as easy as using the locker correctly and just encoding the structure to/decoding it from JSON.

Kabuta: GoClipse meets Delve

I am addicted to having good development tools, specifically, I like my debuggers visual. Once I was so annoyed by inability to debug a common use case of calling a PL/SQL stored procedure from Java, I made it into a single-stack cross-language debugger project, for example.

And I’m also a fan of Eclipse (well, see above).

So when Romana went with Go (pun definitely intended) as our language of choice, I quickly found GoClipse, and it was nice.

But then I learned about Delve and I wanted to have the best of both worlds — the functionality of Delve and the graphics of GoClipse.

So here’s an idea… Eclipse (from which GoClipse is, obviously, derived) understands GDB/MI, and can represent it visually. So all that remains is to adapt Delve’s API to GDB/MI.

And so I started it as a Kabuta project. Let’s hope I don’t abandon it…

Content assist

It looks like you are researching razors. I think you are about to go off on a yak-shaving endeavor, and I cannot let you do that, Dave.

What I would really like my DWIM
agent to do. That, and to stop calling me Dave.

Being lazy and impatient, I like an idea of an IDE. The ease of things like autocompletion, refactoring, code search, and graphical debugging with evaluation are, for the lack of a better word, are good.

I like Eclipse in particular — force of habit/finger memory; after all, neurons that pray together stay together. Just like all happy families are alike, all emacs users remember the key sequence to GTFO vi (:q!) and all vi users remember the same thing for emacs (C-x C-c n) – so they can get into their favorite editor and not have to “remember”.

So, recently I thought that it would be good for a a particular DSL I am using to have an auto-completion feature (because why should I remember ). So I thought, great, I’ll maybe write an Eclipse plugin for that… Because, hey, I’ve made one before, how bad could it be?

Well, obviously I would only be solving the problem for Eclipse users of the DSL in question. And I have a suspicion I am pretty much the only one in that group. Moreover, even I would like to use some other text editor occasionally, and get the same benefit.

It seems obvious that it should be a separation of concerns, so to speak:

  • Provider-side: A language/platform may expose a service for context-based auto-completion, and
  • Consumer-side: An editor or shell may have a plugin system exposed to take advantage of this.

Then a little gluing is all that is required. (OK, I don’t like the “provider/consumer” terminology, but I cannot come up with anything better — I almost named them “supply-side” and “demand-side” but it evokes too much association with AdTech that it’s even worse).

And indeed, there are already examples of this.

There is a focus on an IDE paradigm of using external programs for building, code completion, and any others sorts of language semantic functionality. Most of MelnormeEclipse infrastructure is UI infrastructure, the core of a concrete IDE’s engine functionality is usually driven by language-specific external programs. (This is not a requirement though — using internal tools is easily supported as well).

  • Atom defines its own API

And so I thought – wouldn’t it be good to standardize on some sort of interaction between the two in a more generic way?

And just as I thought this, I learned that the effort already exists: Language-server protocol by Microsoft.

I actually like it when an idea is validated and someone else is doing the hard work of making an OSS project out of it…

Rethinking data gravity

At some point I remember having a short chat with Werner Voegels about taking spot instances to extreme in a genuine market in which compute power can be traded. His response was “what about data gravity?” to which my counter was — but by making data transfer into S3 free (and, later, making true the adage about not underestimating the bandwidth of a truck full of tape) you, while understanding the gravity idea, also provide incentives to not make it an issue. As in — why don’t I make things redundant? Why don’t I just push data to multiple S3 regions and have my compute follow the sun in terms of cost? Sure, it doesn’t work on huge scale, but it just may work perfectly fine on some medium scale, and this is what we’ve used for implementing our DMP at OpenDSP.

Later on, I sort of dabbled in something in the arbitrage of cost space. I still think compute cost arbitrage will be a thing; 6fusion did some interesting work there; ClusterK got acquired by Amazon for their ability to save cost even when running heavy data-gravity workload such as EMR, and ultimately isn’t compute arbitrage just an arbitrage of electricity? But I digress. Or do I? Oh yes.

In a way, this is not really anything new — it is just another way to surface the same idea as Hadoop.

OpenDSP’s DMP: Nanoput

Here I will describe the Nanoput project, which comprises a large part of OpenDSP’s DMP (Data Management Platform). There are, of course, other pieces — the entire picture will be painted under the DMP tag.

Overall DMP stack is as follows:

  • Nanoput proper: NGINX with Lua to handle . I’ll confess that at the moment this is a bit of a pet (as in not cattle). We are considering using OpenResty instead of rolling our own, which uses parts of OpenResty. But no matter, here I will present some features that can be achieved with this setup — and one instance is capable of handling all this.
  • Redis for storing and manipulating user sets — ZSET is great
  • MySQL for storing metadata — will be described in a separate post
  • PHP/JS for a simple Web interface to define the said metadata
  • Python for translating metadata into the configuration for NGINX
  • AWS S3 for storing raw logs — pre-partitioned so that EMR can be used easily.

Conceptual introduction

Conceptually, let’s consider the idea of an “event”. An impression, a conversion, a video tracking event, a site visit, etc, is an event — anything that fires a request to our Nanoput is. You may recognize a similarity with Snowplow — and that is because we are solving a similar problem.

To proceed further, I must ask of a little indulgence — please bear with me. As a technologist, it really grated me to hear the words like “Javascript pixel” but I have learned to stop worrying and love the bomb. Therefore, Javascript pixel it is when it is a JS code that fires some GET URLs. Now, then, the requests are assumed to be fired as GET HTTP requests (do not have to be but that’s the primary idea behind Nanoput per se — management of metadata to ingest things via other means, like FTP upload, etc, will be addressed separately) and can originate, for example, from:

  • Exchanges or DMPs, as an exchange-initiated cookie-sync: see below.
  • Regular user behavior — impressions, in case of video, video-tracking events; conversions
  • Calls as if initiated by exchanges or DMPs to Nanoput but in reality are, heh, Javascript pixels

Now, let us also consider the idea of a “user segment”. If you think about it, a segment is just a set of users. Thus, we may as well consider a user that produced a certain event as belonging to some segment. These may be explicitly asked for, such as “users we want to retarget“, or “users who converted”, etc. But there is no reason why any event cannot be defined as a segment-defining one.

Segments, here, are a special case of data collections concept discussed in a different post.

Given that, we can now dive into Nanoput implementation

General data acquisition idea

Here, we simply leverage basic NGINX functionality, that is logging. To that end, we split the main config file into sections that we include that deal with log format and location and behavior.

Static data acquisition URLs

By “static”, here we mean common use cases that are just part of Nanoput (hence the “man” subdirectory you will notice in examples — stands for, just like the Unix man command, for “manual”). Here we have:

  • Site events (essentially, those are an extension of retargeting concept).
  • Standard event tracking — by which we mean, standard events that happen in Ad world.

Notice that we also augment information available from NGINX (HTTP headers, etc.) with geo data using GeoIP module and user-agent/device/OS

Dynamic (metadata-driven) data acquisition URLs

Dynamic data acquisition works simply: a process reads the metadata table and creates appropriate entries in the NGINX configs that define log format and location and behavior.

Creating “segments”

On every “event”, consider script. We use awesome Redis’s Sorted Set functionality here inserting things twice. The key idea here, again, is a variation on dealing with data gravity concerns by just duplicating storage. We create two sorted sets for each key, the “score” being the first and last time we have seen the user. The reasoning for this is that:

  • First-seen: we can write batch scripts to get rid of users we have last seen over X days ago (expiring targeting).
  • Last-seen: helps us with conversion attribution (yes, this assumes naive last-click attribution or variants.

Duplication is not just for every user — it is for every set. The key here is the set (or segment) name, and the value is the set of users.

An added benefit of this is that new segments can be created by using various Redis set operations (union, intersection) easily.

Some useful shortcuts for a DMP

  • Getting OS/browser info without necessarily using WURFL (though that can easily be fronted by NGINX too, actually).

Exchange cookie sync

In the display world, there is a need for cookie syncing between DSP and a third-party DMP or an exchange/SSP, and that can be either exchange, DMP or DSP-initiated, or both. Some exchanges may allow the redirect chain to proceed further, some may not. Nanoput provides this functionality for exchanges we deal with as well as a template for doing it for other partners — at the speed that NGINX provides. Here are the moving parts:

Storing for further analysis

Raw logs, formatted as above, are uploaded to S3. Notice that they are stored twice, with different partitioning schema. This is one of the key ideas in Nanoput — storage is cheap; duplicating the storage this way and then using one or another partitioning schema depending on the use case:

Cookies coming home to Proust

Ok, I couldn’t resist the title, and also too, Billy Collins’ line about “No cookie nibbled by a French novelist could send one into the past more suddenly” is just awesome so why not… But I digress…

At OpenDSP, we, obviously, cookie (yes, it is a verb — to cookie) users. While more details can be had in other posts (such as Nanoput, and more under DMP tag), here I’ll just address the format issue.

The cookies we use are known as mod_uid cookies because of the Apache project. (A good writeup is done here). Because we use NGINX, and, consequently, use its uid_set/uid_get functionality, one may be puzzled when reconciling the cookie as seen sent to the user and what NGINX writes in its logs. So here

Thinking outside the box — and in a sandbox

Give engineers the same problem, they all will come with roughly the same solution. But I am proud to actually have invented a different one.

Many systems, such as ad servers, RTB bidders, DSPs, etc., store the business rules (targeting, pacing, etc.) information that a user manipulates via UI in some sort of database (likely RDBMS), and then probably transform it into some sort of fast lookup for the real-time decision making. Seems straightforward.

At OpenDSP we went for a new approach which I believe is quite powerful. Simply put, we compile the rules from DB into a Groovy script (well, a set of Groovy scripts). Given that Groovy compiles to JVM bytecode, essentially we create concrete subclasses conforming to the following interfaces and using the following abstract classes:

  • For ads, that is, targeting, implementing Ad interface and subclassing AdImpl
  • For creatives, implementing Tag interface and subclassing TagImpl
  • Finally, there is BidPriceCalculator interface, which will be described below. This is not auto-generated.

A script that compiles the DB rules into the scripts is triggered by the UI whenever anything is saved, and the Lot49 process will reload any changed scripts every so often (e.g., 5 minutes).

Generally, the algorithm is as follows: During a request, the system will take the OpenRTB bid request and augment it with various data (e.g., geo lookup, information on user, etc). It will then, for each Ad, call its canBid1() and checkSegments() methods (both must return true). canBid1() is called first, to filter out quickly those ads that won’t bid based on the data already available in the request, before we augment it with data that needs to be fetched from a user cache; after that, checkSegments() will be called on the remaining candidates. In turn, canBid1() will call the canBid() methods of all Tags, to check which, if any, creatives fit this bid request (based on media type, size, video duration if applicable, etc.)

Finally, the price is determined based on pacing and budget settings, unless the appropriate BidPriceCalculator is found for an Ad, in which case it is used to get the bid price.

The resulting scripts are essentially a bunch of getters based on the rules in the DB, and the abstract classes implementing the interfaces are just a bunch of if/then statements using those getters. So, what is the advantage of this rather than just reading rules from the DB?

The advantage is that any part of the generated script can be modified manually for custom targeting and/or bid pricing based on some model. These scripts may be edited by data scientists independently, eliminating the need for engineers to translate data scientists’ models into code! All the data that the model needs will be provided in the arguments to the appropriate methods; just implement the interface in Groovy, and you’re done!

It is important that all data is provided in the input, so the scripts do not need to concern themselves with high-latency fetching of data from somewhere. It is also safe, even if the data scientist makes a mistake. Consider: because we run in JVM, we can take advantage of the Java security model and create our own SandboxSecurityManager to prohibit network calls, harmful calls such as System.exit(), only allow it to call helpers from certain packages, etc.

One caveat, however: security model is not much help for other harmful things such as recursion or infinite loops. The idea, however, is to solve those as follows (about which we’ll talk in a later post):

  • Recursion: by using static analysis on the loaded code
  • Loops: either by doing the same and prohibiting loops completely, or, if undesirable, by observing which scripts run longer than some time and blacklisting those that exceed the threshold.

Let’s look at some examples:

  • Here is an Ad targeting US and one of the segments, “386:fp:236” or “387:fp:236”.

    Quick note: “fp” here indicates it is a first-party segment; for more information on how these are created and managed securely in a multi-tenant environment, see Nanoput and other articles on our DMP elsewhere on this blog.

  • A creative for this ad is a 300×250 banner. Notice the convention of how the ID of the Ad is included in the name of the Tag. This is for demonstration purposes, and thus a lot of code is unrolled here; in reality, methods like getClickRedir() or getTagTemplate() would just be used from the TagImpl class
  • Finally, BidPriceCalculator (you can see it is defined in the Ad). As can be seen, it uses a formula based on a segment score and a viewability score from Integral to come up with a price. (This, of course, is just an example.) Notice that the information about the user’s score, which is part of the DS-built model, and Viewability score are part of the augmented request that is the argument to getBidPrice() method. In other words, as noted above, the code here needs to just execute a formula, and all the data will be provided to it; this allows us to sandbox this code for safety while allowing for flexibility.

Pretty cool, if I may say so myself.

P.S. You may feel free to contrast this approach with Xandr’s Bonsai.


Music of the post: Feeling groovy

A post-mortem of a project: Wildboard

Once upon a time, I was working at Snaplogic, and at that time its office was in downtown San Mateo. Pretty much across the street from a great coffee shop, Kaffeehaus.

If you’ve been there, or even if you just looked at the website, you’d realize that the owner put quite an effort into it being a Viennese-style coffee house, with all the interior design decisions that go with it.

Now, a local coffee shop is often a place where people expect to post some local notices and ads (“lost dog”, “handyman available”, “local church choir concert”, etc). And here’s a conundrum. A simple cork bulletin board with a bunch of papers pinned to it just did not seem to fit the overall mood/interior/decor of the cafe:

Yet the cafe does want to serve local community and become an institution.

This being Silicon Valley, Val, the Kaffeehaus owner, had a vision — what about a virtual board, as a touch-screen.

The name was quickly chosen to be Wildboard — because it is, well, a bulletin board and in honor of the boar’s head that is prominently featured on the wall:

.

A multi-touch-based virtual bulletin board sounded interesting. Most touch-screen kiosks I’ve seen so far — in hotels and malls, for instance, or things like ImageSurge — only allow tap, not true multi-touch. (To be honest, multi-touch may or may not be useful — but see below and see also P.S. — but it is a very nice “pizzazz”).

And we — that is, myself and Vio — got to work. And in short order we had:

  • Fully multi-touch (with rotation, zoom, etc) web UI — as a Windows 8 CSS/Javascript app (source).
  • Wildboard “board server” — a Python app running on the same computer as the UI. It is responsible for polling the web server (below) and serving information to the UI (source).
  • Wildboard web server — a PHP app based on an existing web classified application(source). This allows users to submit ads (or they can do it via a mobile app, as below). It is also modified to automatically create QR codes based on user-provided information (map, contact, calendar, etc) and adds them to an ad.
  • Wildboard mobile app — PhoneGap/Cordova based app for both Android and iPhone (source)
    This app allows one to:

    • Post an ad
    • Scan an ad’s QR code
    • And, finally, for the “Wow!” effect during the demo, one can drag an ad from the screen into the phone. Here it is, in action:

  • Wildboard orchestrator — a Node.js app (source) designed to coordinate interactions between the mobile app and the board. It is the one that is determines which mobile app is near which board and orchestrates the fancy “drag” operation shown above.
  • For more information, check out spec and the writeup.

Charismatic Val somehow managed to get a big touch screen from Elo Touch. Here’s how it fit in the decor:

A network of such bulletin boards, allowing hyper-local advertising, seems like a good idea. Monetization can be done in a number of ways:

  • Charging for additional QR codes — e.g., map, contact, schedule.
  • Custom ad design (including interactive and advanced multimedia features — sound, animation, video).
  • A CPA (cost-per-acquisition) model, while tracking interaction via an app — per saved contact, per scheduled appointment, per phone call.
  • Premium section.

But… alas… This is as far as we got.

P.S. One notable exception is a touch-screen showing suggestions in Whole Foods in Redwood City.

OpenDSP – stay tuned

Every marketer, it seems, wants to participate in real-time bidding (RTB). But what is it that they really want?

They want an ability to price (price, not target!) a particular impression in real-time. Based on the secret-sauce business logic and data science. Fair enough.

But that secret sauce, part of their core competence, is just the tip of the iceberg — and the submerged part is all that is required to keep that tip above water. To wit:

  • Designing, developing, testing and maintaining actual code for the UI for targeting, the bidder for bidding, reporting, and data management
  • Scaling and deploying such code in some infrastructure (own data center,
    clouds like AWS, GCE, Azure), etc.
  • Integrating with all exchanges of interest, including the following steps:
    • Code: passing functional tests (understanding the exchange’s requirements for parsing request and sending response)
    • Infrastructure: ensuring the response is being sent to the exchange within the double-digit-millisecond limit
    • Scaling: As above, but under real load (hundreds of thousands of queries per second)
    • Business: Paperwork to ensure seat on the exchange, including credit agreements when necessary
  • Operations: Ongoing monitoring of the operations, including technical (increased latency) and business (low fill level, high disapproval level) concerns (whether these concerns are triggered by clients, exchange partners or,
    ideally, pro-actively addressed internally.

None of which is their core competence. We propose to address the underwater part. It’ll be exciting.

Enter OpenDSP. We got something cool coming up here. Stay tuned.

Watch this space, and by this space I mean this blog, this GitHub account.

Of course you have an API!

The following is a dramatization of actual events.

“I need access to these reports.”

“Well, here they are, in the UI.”

“But I need programmatic access.”

“We don’t have an API yet.”

“Fine, I’ll scrape this… Wait… This is Flex. Wait, let me just run Charles… Flex is talking to the back-end using AMF. So what do you mean you don’t have an API? Of course you do — it is AMF. A little PyAMF script will do the trick.”

“Please don’t show it to anyone!”

P. S. That little script was still running (in “stealth production”) months, if not years, later.

Development philosophy

This is one of those posts that will continue to get updated periodically.

I’ve been asked to describe my software development philosophy (and variations thereof) often, so I’ll just keep this here as a list.

  • The right tool for the right job. A “PHP programmer”, to me, is like a “screwdriver plumber” or a “hammer carpenter”. First, figure out the problem you are trying to solve, then, pick the tool.
  • Do not reinvent the wheel.
    • It is likely that others solved a similar problem. There may be solutions out there already, in the form of libraries, SaaS, FOSS or commercial offerings, etc. Those are likely to have gone through extensive testing in real life. Use them. Your case is not unique, nor are you that smart.
    • You are not that smart.
    • Consider buying (borrowing, cloning, licensing) rather than rolling your own
  • Engineering is an art of tradeoffs. Time for space, technical debt for time to market, infrastructure costs for customer acquisition, etc.
  • Abstractions leak.
  • The following things are hard. However, they have been solved and tested and worked for years, if not decades. Learn to use them:
    • Calendars and timezones
    • Character encodings, Unicode, etc.
    • L10n & I18n in general
    • Relational databases
    • Networking (as in OSI)
    • Operating systems

Reporting: you’re doing it wrong

I’ve often said that there are certain things average application programmers just do not get. And those are:

  • Calendars and timezones
  • Character encodings, Unicode, etc.
  • L10n & I18n in general
  • Relational databases
  • Networking (as in OSI)
  • Operating systems

And by “do not get” I do not mean “are not experts in”. I mean, they don’t know what they don’t know. Time and time again I see evidence of this. Recently I saw one that was so bad it was good — and that, I think, necessitates a meta-like amendment to this list, in the spirit of Dunning-Kruger. As in:

You are probably not the first person to have this problem. It is very likely that smarter (yes, snowflake) people already solved this problem in some tool or library that has endured for years, if not decades. USE IT!

Here is the incident, worthy of The Daily WTF.

There is a monthly report that business runs (it doesn’t really matter what kind of report — some numbers and dollars and stuff). How is the report being generated? (Leaving aside for now the question of why one would roll one own‘s reports, rather than using a ready-made BI/reporting solution. Using the brilliant algorithm that Donald Knuth forever regrets not including in his TAO series:

  1. Set current_time to the chosen start date, at midnight.
  2. Output the relevant data from day specified by current_time.
  3. If current_time is greater than the chosen end date, exit.
  4. Increment current_time by 86400 (because 86400 is a universal constant).

What could possibly go wrong?

Nothing. Except when you hit the “fall back” time (end of DST). Once you go past that date, the system will subtract an hour, and you end up at 11pm the previous day, not midnight of the day you wanted. And because you’re stupid, you have no idea why all your business users are waiting for those reports forever.

Java-to-Python converter

“Anything that can be done, could be done ‘meta'” (© Charles Simonyi) is right up there with “Laziness, impatience and hubris” (© Larry Wall) as pithy description of my development philosophy. Also, unfortunately, there’s another one: “Once it’s clear how toproceed, why bother to proceed” (or something like that). So, with that in mind…

I wanted a Python client library for GData (thankfully, they released
one last week
, so this is moot — good!), so I thought of automagically converting the Java library to Python. I tried Java2Python, but it’s based on ANTLR grammar for Java 1.4, and the library, of course, is in Java 5. As I was relearning ANTLR and writing all these actions by hand (the pain!), I took a break and found
Java 1.5 parser with AST generation and visitor suport by Julio Gesser (no relation,
I presume?) and Sreenivasa Viswanadha, based on JavaCC. Aha! Much easier… But then, of course, Google releases the Python version of the library I needed in the first place, so I don’t bother wrapping this project up… Here it is for whoever wants it: http://code.google.com/p/j2p/.

Poor Man’s Tracepoints and a call sequence of a C program

My C is quite rusty, so to help me figure out the flow of a program, I thought I’d do with gdb what Tony Loton did with JPDA. Quickly giving myself a refresher on gdb, I thought tracepoints are the easiest way to go. Except that they are available only for
remote targets, and

  1. There’s no gdbserver on my host platform (Cygwin)
  2. The target platform does have it (good news!), but it doesn’t support tracepoints (and some say, that few if any stubs even support it

So I wrote a silly Perl script to read ctags information, create breakpoints on every function entry, print the arguments and resume. I haven’t bothered to figure out where to use pure MI vs. CLI commands, and in general I have no clue…

Now, what I think would be interesting is making this an add-on, using CDT (or, more generically, via Eclipse Debug Framework) and GEF… I envision something like a call graph (with exclusions of course, because it will become too big), which grows as you step through, displaying arguments. Could be a quick way to get a picture of how a program works before just reading the code and keeping stuff in your head…


In related news, upgrading to Eclipse 3.2.2 I lost the “Remote debugging” launch configuration. (Ironically, the reason for the attempted upgrade was to see whether a bug with hardcoded remote port has been fixed (it appeared that it’s always 4305, no matter what you put in; while the default one is 1234, which is the kind of thing an idiot would have on his luggage.). Which brings me to the “Zero information content” of various blogs/articles out there (which this blog is trying not to be). Thank you, Bill Graham, for pointing out that “the ‘vanilla’ CDT from http://www.eclipse.org […] doesn’t support remote target debugging”, thus leading me in my Google search to your article which doesn’t tell me how to fix it, nor does it tell me much of anything… There, I vented. (The answer, BTW, is to get Target Management, who knew…)

Being a drop-out was fun for a while…

But all good things must come to an end. 11 years after the scheduled time, I am now a Bachelor of Science in Computer Science and Engineering from MIT:

mit-diploma.png

Thanks be to:

Continuous integration with code coverage in Python

Update

This code has been integrated into the main tree.

I decided it would be good to have a coverage report of our Python code,
with nice visualization like Clover.

So I took Ron Smith’s PyAntTasks, and added py-cover task to them. This will run coverage for every test, and a cumulative one. In other words, you can see what code a particular test exercises, and what code all the tests in your tree exercise.

This also modifies py-test task to include packagedtests attribute – see below.

The newly added py-cover task runs Ned Batchelder’s coverage.py (download it separately), and is specified as follows in your build.xml:

 

    
      


     
     


Here, is a FileSet specifying tests to run, is a FileSet specifying source code to cover.
The attributes are:

  • reportsDir – where the coverage reports go
  • packagedtests – this idiosyncrasy is prompted by our tree setup. If this attribute is true it means that the test files reside in Python packages, false otherwise. (In our case, they do not; they are in the tree but are not packages. Note that the original py-test task assumed they are in packages, I have changed this too).
  • coverage – path to coverage.py on your system (which you downloaded separately, right?)

P.S. I know about the colorize.py thingie, but I rolled my own (uglier, of course) for this one.

XmlRpc for AS 3.0 HOWTO

Since people ask (in emails, at the short-lived MT version of this blog and on flexcoders@yahoo), here is how I use it.

DISCLAIMER: as I’ve mentioned, I hacked this from the original when I was just starting with Flex 2, so I gutted some code before I knew what the hell I was doing,
but in this state it works for me… 

I use Flex Builder. I have the contents of the zip file right in my Flex Builder project. (NOTE: This was before I read that flex should not be a part of a package name per Adobe’s license; change it, willya?)

Then I choose have a simple facade like so:

package com.qbf.flex.ct {
  // Insert correct imports here, of course...

  public class Utils {
    public static function serverCall(method:String, 
                                          params:Array,    
                                          handler:Function): void {
        SERVER_CALL.call(method, 
                         params, 
                         handler, 
                         DisplayObject(Application.application));
    }

    public static const SERVER_CALL:XmlRpcService = 
   new XmlRpcService("http://localhost:8081/", 
                        Application.application.url.indexOf("http://") == 0? null :
   null :
   // This is explained further below... 
   QbfFakeResponse.SINGLETON); 
}

And from everywhere else in the code I use Utils.serverCall() as
follows:

var params:Array = [{value : resdefDataToServer, type: "struct"}, 
uri, server];
Utils.serverCall("doStuff", params,
        // This callback will be executed when the server 
        // call returns
 function(result:Object):void {
           process(result);
        });

Notice that you only need to specify types for parameters if it’s not a string. If all your parameters are string, you can just pass an array of their values to the serverCall() –a little sugar…

Now about this QbfFakeResponse.SINGLETON business (it’s not necessary, just use null in its place if you don’t need it). The definition ofSERVER_CALL checks whether the application URL starts with http:// — in my case, it means that it’s running “for real”. If it is running in Flex Builder’s debugger, the URL will start with file://,
and so I will use “fake” responses to simulate server calls (this is sometimes useful for debugging, so that I don’t have to run a server). This fake response class is an implementation of com.qbf.flex.util.xmlrpc.FakeResponse interface. All you need to do is implement getResponse() method and return whatever you need by checking the method name and parameters provided.

The King, the Jedi and the Prodigal Son walk into a bar…

So, earlier I tried to switch to Blogger briefly, because my LiveJournal was messing up javablogs feeds (and I wanted something trackback-like).

But then I missed this tag/label/category functionality thingie, so I had a brief affair with Movable Type, but then, voila — The New Version of Blogger. Good, I don’t have to host the stupid thing then…


Peter Kriens has been working too much: “Today an interesting project proposal drew my attention: Corona. Ok, the name is a bad start. The Apache model of names without a cause is becoming a trend.” Eh? I was with you until the last sentence — but it’s not an Apache model of names without a cause, it’s a model of — aw, geez, there must be a pithier term for it — names for things associated with main product that are in some ways puns on the original name (JavaBeans, Jakarta, etc.) Get it? Sun – Eclipse, Eclipse – Corona? (Things will really get out of hand — with horses! — when a Corona-associated product will be called Dos Equis).

Silly Flex trick of the day

Since “only top-level components in the application can have context menus“, how would one make a different ContextMenu for every type of item in a tree? In particular, I want tree leaves to have different menu items enabled than the branches; e.g., the leaves should have “Properties” menu enabled, and the “Create new” menu disabled, and vice versa for the branches. Mac Martine mentions using
rollovers, which is cute, but it looks like a more robust way is to use
MOUSE_OVER (yeah, I tried both).

To do this, use the following TreeItemRenderer in the tree in question (obviously, you can choose to add/remove things
from the ContextMenu.customItems rather than enabling/disabling them…):


public class ServerTreeRenderer extends TreeItemRenderer {

override protected function updateDisplayList(unscaledWidth:Number, unscaledHeight:Number):void {
super.updateDisplayList(unscaledWidth, unscaledHeight);
if(super.data) {
if (TreeListData(super.listData)) {
leaf:Boolean = !TreeListData(super.listData).hasChildren ;
super.label.addEventListener(
MouseEvent.MOUSE_OVER,
function(evt:MouseEvent):void {
// This will disable all items in my context menu (it's declared somewhere
// else, this is not a TreeItemRenderer method, duh...
disableAll();
if (leaf) {
// propertiesItem is a ContextMenuItem...
propertiesItem.enabled = true;
} else {
contextMenuContainer.createNewItem.enabled = true;
}
[...]

All of this is in anticipation of a promised exegesis on how a
custom ContextMenu works anyway (I have no time for this
at the moment), but works for now…

P.S. I just like the word “exegesis”. The more exegeses, the merrier…

Dbdb – a JPDA-based single-stack debugger for mixed-language programming

Dbdb project is officially up for adoption, because I have no plans for working on it (I am sick of it).

Dbdb is a proof-of-concept of a JPDA-based single-stack debugger for mixed-language programming, done as an Eclipse plugin (but doesn’t have to be). It is based on Java 6 (“Mustang”). The proof-of-concept is allowing a developer to debug Java code that calls a PL/SQL stored procedure. The debugging session in Java proceeds normally, nothing to write home about. When a Statement.execute() (or similar) statement is executed, however, the debugger connects to the Oracle’s VM and shows a combined call stack, from Java down into PL/SQL. (See screenshot). The idea, of course, that it can be done with other combinations, but Java-into-Oracle-stored-proc is a very common scenario.

P.S. This is a rehash of an older post. I am trying to see what Blogger is like vs. LJ (for instance, LJ breaks javablogs feeds).

That’s it, done…

That’s it, done!

Bassem (Max) Jamaleddine

 
Prof.Madden finally approved the latest version of Dbdb write-up, and so I am all set for my 10+-years-overdue degree. With that, I’ve updated the sourceforge project
with all the latest stuff from my workspace, including the docs on the page, Javadoc, code (and aforementioned docs also) in CVS, etc (even a screenshot).
Dbdb project is officially open for adoption, because I have no plans for working on it (I am sick of it). Fly, baby, fly…
P.S.

  • I have to see whether Pat and Spencer actually decided to use this one for the IDEA Plugin Contest… There’s still time…
  • Maybe I do want to augment it for use with GWT, so it automagically inserts a debugger; statement as the first
    line any native Javascript method… Just for kicks… Nah, it would be too slow…

It’s Friday…

 

Evaluating expressions in PyDev (Eclipse plug-in for Python)

I use PyDev because, probably like many, I am used to Eclipse for Java development. What I found useful is highlighting a snippet (expression) in a debug session and doing Ctrl+Shift+D to evaluate it, and  I miss this in PyDev. A crude workaround  is to add this expression to Watch list, but that grows the Watch list and is not convenient: I not only have to do right-click Watch and then look in the Watch list, but also may need to scroll that list, and remove things, etc. That’s not what I am used to. So I threw together a crude implementation of it.

The change is in the org.python.pydev.debug project:

  1. Added
    EvalExpressionAction class to org.python.pydev.debug.ui.actions package.
  2. Changed the plugin.xml
  3. The MANIFEST.MF
    thus includes two additional bundles in Require-Bundle: field: org.eclipse.core.expressions and org.eclipse.jdt.debug.ui. (Well, the second one is only for the second keystroke – “persisting” the value in the Display view, and only because I was lazy at this point. But also, since this thing relies on other org.eclipse.jdt stuff, I figured it’s not a big deal).

    Another problem here is that I couldn’t figure out how to do Ctrl+Shift+D the second time for persisting; so Ctrl+Shift+D works to display in a popup, and Ctrl+Shift+S does the persisting. (The choice of “S” is since when I press Ctrl+Shift+D my index finger is on D and so it’s easy and fast to use the middle finger to press S immediately :). But that still is close to what I am used to blindly press. People get used to all sorts of weird keystrokes and go out of their way to reproduce them in their new environment, just witness viPlugin for  Eclipse.

Of course, as I went to announce this on the list, I saw that PyDev already has a slightly different mechanism for that. O well, at least this way still saves me some keystrokes and I learned that the Console view is also a Python shell. (That’s cause I never RTFM)… But at least I was not the only one

So anyway, this seems to work in my environment; just unzip into the Eclipse folder – and do so at your own risk…

How to waste a weekend

I spent half of Friday tracking down a GWT problem that caused problems in both Opera and IE. By trial and error I kind of isolated it to my use of syntactic sugar (yeah, I know…) Then I spent most of Saturday trying to come up with a pithy reproducible case. Here it is
(posted to http://groups.google.com/group/Google-Web-Toolkit/browse_thread/thread/ae307a25041f3a7e/, reproducing here for my own reference):

Consider the following source (made with -style pretty):

package x.y.z;

import com.google.gwt.core.client.*;
import com.google.gwt.user.client.ui.*;

public class WastedWeekend extends Label implements EntryPoint {
public void onModuleLoad() {
foo();
}

void foo() {
Button[] stuff = new Button[1];
doNothing(stuff[0] = new Button());
}

void doNothing(Widget w) {
RootPanel.get().add(w);
}

The foo() function translates into the following:


function _$foo(_this$static){
var _stuff;
_stuff = _initDims('[Lcom.google.gwt.user.client.ui.Button;', [0],
[5], [1], null);
_$doNothing(_this$static, _stuff[0] = _$Button(new _Button()));

}

This is all well and good. But now let me change, in foo() the type of “stuff” to Label
(that is, the superclass of WastedWeekend). That is, as follows:


void foo() {
Label[] stuff = new Label[1];
doNothing(stuff[0] = new Label());

}

Now the translation is:


functionfunction _$foo(_this$static){
var _stuff;
_stuff = _initDims('[Lcom.google.gwt.user.client.ui.Label;', [0],
[7], [1], null);
_$doNothing(_this$static, _setCheck(_stuff, 0, _$Label(new
_Label())));

}

Notice the difference? What’s going on?

This is reproducible whenever this sample WastedWeekend class is the same subclass of the same
Widget as the stuff variable.
(I did not go into more details as far as inheritance tree, so I don’t know what happens if they are siblings or something)…

Now, should I spend Sunday trying to actually figure this out?

Just say no to Holub


Boo-hoo! You had me, and then you lost me!

Frank Sinatra

При чем тут голубь?

Репортаж с Первых Весенних Олимпийских Игр

Yeah, yeah, we do want to “Just say ‘No’ to XML“. Amen.
And +1 to Mr.Holub for noting that “…many so-called programmers just don’t know how to build a compiler. I really don’t have much patience for this sort of thing.” But
it’s all downhill from there:

  • -0.1 for describing Ant as a “scripting language” (it really is declarative…)

  • -0.4 for picking on Ant, of all things, in the first place. Some people can write a compiler and still manage
    to subject “every one of [their] users to many hours of needless grappling with”, oh, I don’t know… make???

  • -0.5 for plugging his book at the end

  • -10 for doing the above with an innocent “By the way”. (+10 if this “innocence” is tongue-in-cheek, Lt.Columbo-“Oh, and just one more thing”-like. But
    “architects, consultants and instructors in C/C++, Java and OO design” don’t do this kind of subtlety.)

In all, Mr.Holub is 10 in the hole for this round…

A classic case of how a perfectly defensible thesis is ruined by the examples…

More WIBNIs

 


P.S. And on the lighter side…

KISS

A propos that meme/urban legend/тема/байка about the space pen and a pencil

I’m using Alexei Sokolov’s version, not the SVG one, but close enough… So, just like commenter Stefan, I was wondering about using text with GWT Canvas. So when Robert Hanson responded that “That is definately on my to-do list, but I don’t have a timeframe for doing it”, I envisioned some weird stuff like sending vector font definitions and rendering them or some such thing…

And then I think, PopupPanels will do… Don’t forget to override onEventPreview() like so:

public boolean onEventPreview(Event event) { return true; }

Tips

  • Cygwin productivity tips
  • It is possible to do ASP/JavaScript debugging without using the Visual Inter Dev by using IIS script debugger
  • If you ever see an error like “Java Plug-in for Netscape Navigator Should Not Be Used with Microsoft Microsoft Internet Explorer. Please Use Java Plug-in for Microsoft Internet Explorer”, see http://home.att.net/~cherokee67/ffjavaerror.html
  • Creating Windows EXE from Java (don’t ask):I found NativeJ the best for my task. I also tried JexePack, but it didn’t seem to do well with IBM JVM (it says that unregistered version will just show a modal dialog box on startup, and it did on my machine, but when I tried running it on the WebSphere box, it seemed to create an endless loop with those modal dialogs. Finally, Excelsior JET is good, but too complex for a little task like this — creating an executable involves too many steps skipping over things I don’t need. The interface is not the best either — too much like a wizard, too little like an IDE that it really is. But maybe it’s good for bigger, more involved projects.

GMail WIBNIs

So I noticed that when I got an e-mail about an appointment, GMail helpfully (no, I mean it!) included a conspicuous link for entering
this appointment into my Google calendar. Which leads me to a couple of WIBNIs:

  1. When I get a bounce, I should get a similar link allowing me to remove this address from my contact list. (Parse the email, come on, I know you already do, so it’s not that big an invasion of my non-existent privacy to see that this email came from a MAILER-DAEMON or something)…
  2. More or less ditto for locations mentioned in emails.
  3. When I do “Report Spam”, I don’t really give a flying spaghetti monster what the underlying algorithm is, but is it too much to expect never to see a message from that particular address in my inbox?
  4. In general, perhaps there’s a way to allow people to create solutions for similar WIBNIs, immediately adding this functionality to their own account and also contributing them to some central repository of solutions, thus enhancing Google’s hegemony further, if that’s even possible.
  5. I’ll be having more to say…

P.S. A couple of days after discussing with BOBHYTAPb the silliness of Google’s attitude toward “mail sent to yourself will not appear in your inbox as you expect because it’s a feature and you’re gonna like it and we don’t give a shit that that’s what you expect<'cause your expectations are due to bad upbringing", I noticed that this changed.

IOException: OutputStreamOfConsciousness is not accepting any more output

Given my “penchant” for using character names from French adventure narratives, I have decided to give Dbdb project the code name “Bragelonne” (the link is for… you know…). It is, after all, ten years since, you know… Which is all the more fitting (ironically) as I am about to give it up for adoption…

WIBNI…

 

    1. A “Debugging Eliza” idea from BOBHYTAPb

      Here is a bomb of an idea: A Debugging Eliza.

      After a long and fruitless session of debugging a programmer reaches a certain dead end, where he has already glossed over the problem, has not found it, but noted subconsciously that that venue has been checked – or simply didn’t think about it. At this point he needs to talk to someone about this problem – a sort of a psychiatrist, which doesn’t even have to be human – it could be a slightly tweaked version of Eliza.

      “What are you doing?”

      “I am debugging Blah-Blah Industrial Application.”

      “What was the last thing you tried?”

      “I checked that the configuration file corresponds to the Blah-Blah…”

      “And how is the Blah-Blah?”

      “It’s perfectly fine.”

      “And how does that make you feel?”

      “It means the problem is somewhere else.”

      “Where else could it be?”

      etc.

      Obviously, this is where a lot of problems are found — when you are asking someone for help, and in the process of explaining the problem realize your error.

  • Eclipse, for all its cool pluggable architecture, lacks a basic thing — macros, which should be easy given the above. That is, a way to record (or write by hand, fine) a series of steps to instruct the Eclipse workbench to do something, and then play it back. Where’s AppleScript  when you need it?

    For example, instead of creating a walkthrough. Yes, part of the pain in this particular case can be solved by, for example, checking in dot-files into the source control, and then telling everyone to “Import existing projects into a workspace” after checking out the tree. But I can’t do that — there are dot-files of a “competing” approach checked into the repository, which suit some of us fine, but lack the things others want. but that’s in this particular example, and I cannot come up with another case right now, but trust me, they exist.

     

 

YODL

Once upon a time, BOBHYTAPb, Shmumer, others and yours truly thought
that a short-term LARP-like online game
could be interesting. (Nothing came of it, of course.) One of the
problems sited at the time was that computer games were lacking in
modeling of reality in general (duh!). In particular, the thought
went, the problem is with OOP itself. So YODL was conceived. (Did I
mention that nothing came out of it?) Shortly thereafter I discovered
Subject-Oriented
Programming
articles… A while later, I found notes about YODL
which I reproduce here in their incoherent entirety without any hopes
that anyone cares
, using this LJ as my personal repository of
stuff to refer to, maybe…


interface to other language/objects/functions?

1. Interceptable actions

The most limiting feature of this scheme is the finality of all
actions. In MUDs, one active object can intercept another object’s
action and veto it. Here, once an action is initiated it is performed
(see the caveat below). Other object can only react to it later
on. Example: in a MUD, you can place a closed chest and a guardian
over it. If you try to open the chest, the guardian stops you – have
to kill him first. In this scheme, if you are close enough to the
chest to open it – you open it, guardian or not.

2. Environment as a priviledged interceptor

Broadcasting messages to those that are interested and are eligible
cf 1?

3. yodl abstract

YODL is a mark-up language used to rapidly create new game worlds by placing objects in them.
Objects can be existing, taken from the library, as well as newly created (with the YODL as well!)
on the basis of other objects. YODL supports inheritance, with every object inheriting its properties
at least from some ideal object(s) (instances of which cannot be created) , or from other functional
objects. As is standard for such inheritances, properties can be overriden, added, erased.

An object can be created by inheriting from two objects – thus compound objects (e..g, a rifle with
laser targeting can be created out of stock rifle and stock laser pointer).

As much as it seems like a use of standard OO (object-oriented) approach, YODL presents important
innovation over traditional OO approach:

We strive to make the worlds we create believable. To do that, ideally,the user must be able to do
with a given object what he can do to it in the real world. That is impossible under standard OO paradigm.

In the OO paradigm, the designer must specify the behavior that each object is capable of. If a certain behaviour is not specified, the object cannot perform it. This is a great disadvantage. It is impossible to think of all the things it is possible to do with, for example, a cup. What if a user would like to try to hammer nails with it?

YODL provides for that and other behavior by NOT providing specifically for a behavior in an
object. Instead, YODL allows designers to specify a set of actions generally available (hitting, throwing,
heating up an object~) Then the object acted upon executes that action upon itself, and the action,
based on the object’s properties, decides on the consequences. For example, consider a metal cup and
ceramic one. The designer did not specify if either cup can be hit. However, an action of being hit
is in the system, and if a cup is hit, based on its properties, the action will decide if the cup breaks
(ceramic) or bends (metal).

In contrast to OO, this can be termed AO – action-oriented paradigm.
This is a misnomer, however, since YODL does not give preference to
actions (verbs) in favor of objects (nouns). Not getting into
linguistic debates, if we need both to better describe our world,
we will have both.

Other concepts introduced in YODL are related. To go into details, we need to
provide full YODL specification, which we can’t right now. Of interest immediately, however,
are also the following concepts.

  • Action inheritance – to ease the work of designers, actions can be inherited just as objects can be.
  • Faces – actions that can be inflicted upon the object can be
    calculated automatically, some being discarded (e.g., if an action of
    -Y´break¡ cannot possibly be inflicted upon an object, it is
    discarded).

Remaining actions represent a ´face¡ of an object. This is useful to the user, who can then be presented with a list of actions he can do upon the object, as a menu. More importantly, however, this can provide differentiating ´cognitive portraits¡ of user’s characters, forcing the character to see an object in some way. For example, a character that has never heard of a gun will be able to ´press the trigger¡ but not to ´shoot¡ the gun – and definitely not to load it.

Context-aware programming, in that respect like AOP.

An ACTOR is a performer of actions. An OBJECT is something the actions are performed on.

An OBJECT IMPLEMENTS a collection of INTERFACES.

An INTERFACE REQUIRES PROPERTY REFERENCES. When implementing the INTERFACE, the OBJECT must PROVIDE those to the INTERFACE. If the PROPERTY is not set by the OBJECT, it may be deemed UNKNOWN. The INTERFACE must specifically allow a PROPERTY to have an unknown state.

Properties have a state of Unknown. If an Object’s Interface Requires a PR for the Object, and the state of the Property is currently Unknown, the Object never returns that interface as a member of the Face. (or what of optional)

A subset of a collection of INTERFACES that an OBJECT implements is a FACE of the OBJECT. An OBJECT thus has many FACES. (-L~~~~~~~~?) (set of methods???)

When the ACTOR interacts with the OBJECT, the ACTOR constructs an INSTANCE of his PERCEPTION INTERFACE (parameterized both by LOCAL_ENVIRONMEN) and passes it to the OBJECT. Based on the received INSTANCE of the PERCEPTION interface, the OBJECT returns to the Actor a Face, that is, a subset of its collection of Interfaces. What the Actor sees then is a particular Face of the Object, parameterized by the current Actor’s Perception and the current state of Local_Environment.

(Addendum: The Object doesn’t choose shit. It just passes the PERCEPTION to the Intertfaces, and they decide which Subinterfaces make up the face).

TRIGGERS

A Role is a collection of Interfaces (CF. Role theory – GG2001).

Multiple Inheritance Resolution: User Intervention. If an Actor invokes a Method appearing in more than one Interface, the Actor is asked to specify which Interface he had in mind. Or, the desiner provides a default order?

Deconstruction Interface:

BASIC Interface

A tool for script creation — just record the proceedings

Environment Triggers — implicit, e.g., an air balloon’s trigger on local pressure.

PROPERTIES — not just numbers, but instead objects with access methods!

Properties implying interfaces requiring them? — Properties only imply Passive; ACTIVE require knowledge and must be explicitly declared.

Every Interface comes of two Complementary types — Active and Passive. Pasive interface contains handlers for the Active interface.

The Actor passes the collection of his Active interfaces to the Object (with the Perception module). The Object returns a Collection of the Actors subinterfaces, corresponding to what the Object can handle given the current state of Environment and of the Actor. This could inolve Fallback on some of the active interfaces of the actor to their ancestors, as best as the object can handle for that particular interface.

Environment and Actor are special cases of Objects. The also provide Faces to the Actor. Actor’s Face includes Inventory, for example.

Case Study: Actor contains Active “Run” and Passive “Run”. P-RUN changes actors coordinates.

Waht goes into Environment? Walls treated as Objects? Is Environment any different from a specialized collection of Objects we don’t want to ttreat as Objects? Philosophically?

A Face includes all the Representation stuff — graphics, audio, smells, whatever. In fact, these should depend upon the Actor’s Perception and not on ly on the state of the Object.

PH: An Interface can be separated into Effect and Implementation.

Flying is an effect. Winged Flying, Propeller Flying, Jet Flying – implementations of that effect. This mimics Templates — but not completely.

Properties imply Passive intefaces — Passive Interfaces are written using fixed property names. Ergo, Interface designers must communicate heavily.


 Keyword unknown 
 Actions {
 
 Hit (subject object) where 
 
 Object has , is
 
 Subject has 
 
 {
 
 
 
 }
 
 } 
 
 
 Actor me {
 
 Knows hit 
 
 } 
 
 
 class chair implements matter {
 
 state = solid;
 
 // weight not here - automatically unknown
 
 } 
 
 
 class Neanderthal knows wood {
 
 } 
 
 
 class Bird knows wood 
 
 
 interface matter{
 
 property state : {solid, liquid, gas};
 
 optional property weight;
 
 } 
 
 
 interface wood implements matter {
 
 property hardness : 3;
 
 } 
 
 
 class blade implements iron {
 
 property edge: .1; 
 
 
 cut (matter m) {
 
 if (m.state=solid) {
 
 if ()~ 
 
 
 }
 
 } 
 
 
 Neanderthal N; 
 
 
 Bird b; 
 
 
 Wood w;
 
 Knife k;
 
 b.use(k.cut(w)); 
 
 

Some random links jotted in these notes:

RSS WIBNI

So, I broke down and got a paid account just so I could
syndicate (oh, and ).
Does this even work? We’ll see…

So, while I am at it, here’s an RSS WIBNI: a weighted RSS. So that,
for example, occassional entries from stay
on top, rather than being beaten by frequent spewage from something
like /. (I won’t even link to that den of iniquity, but I read
it for the articles…)

Rant

The debate holy war on the topic of software engineering vs “real” engineering seems as endless as GWOT. I am too lazy to do an extensive
search, but I do remember one of the pithy definitions to claim the use of differential equations as a necessary condition…

DISCLAIMER/DIGRESSION
I don’t really care, but “engineer” does sound cooler than “programmer”, which doesn’t have a sci-fi ring to it anymore, or “developer”, ’cause Donald Trump is also one — not that he isn’t cool

But I thought I’d throw just one more difference into the mix. Software engineers — at least those that work in application development — have to use knowledge of other domains — those, for which software is written (e.g., finance, etc.)

As far as I am concerned, these domains tend to be boring… I like technology for technology’s sake… Does that make me more of an engineer?

Discuss

I wonder whether Michael Swaine weighed/will weigh in on it…

P.S. Please…

configureWebserverDefinition.jacl

Per Web
server plug-in installation doc (step 19)
, we should copy the
configureweb_server_name.bat script, generated by
the plugin installer, to the WAS_HOME/bin directory and run it
in order to map the installed applications to the Web server. This was giving us problems on Windows 2003, such as this:

Configuration save is not complete, exception = com.ibm.ws.scripting.ScriptingException:
com.ibm.websphere.management.exception.ConfigServiceException:
WKSP0008E RepositoryException while checking the state of
cells/mycell/applications/application.ear/deployments/application/foo.jar/META-INF/ibm-ejb-jar-ext.xmi
in the master repository -com.ibm.ws.sm.workspace.WorkSpaceException:
WKSP0016E Error get digest for cells/mycell/applications/application.ear/deployments/application/foo.jar/META-INF/ibm-ejb-jar-ext.xmi.workspace_save – java.io.IOException: The system cannot find the specified file, either the filename is too long on Windows system or run out of file descriptor on UNIX platform. java.io.FileNotFoundException: D:optWebSphereAppServerprofilesAppSrv01wstempScript10afa2477bfworkspacecellsmycellapplicationsapplication.eardeploymentsapplicationfoo.jarMETA-INFibm-ejb-jar-ext.xmi.workspace_save (The handle is invalid.)

WASX7309W: No “save” was performed before the script “D:optWebSphereAppServerbinconfigureWebserverDefinition.jacl” exited; configuration changes will not be saved.

What is happening is twofold.

First, the
configureWebserverDefinition.jacl does a save only at the end, which fails for all installed applications. To narrow the problem down, I put $AdminConfig reset after foreach Application $ApplicationList {, and moved the

 if {[catch {$AdminConfig save} result]} {
 puts "Configuration save is not complete, exception = $result"
 } else {
 puts "Configuration save is complete."
 }
 

block from the end of the script into the else clause of

 if {[catch {$AdminApp edit $Application [subst {-MapModulesToServers {$targetMapList } } ]} result]} {
 puts "Target mapping is not updated for the application $Application, exception = $result"
 } else {
 puts "Target mapping is updated for the application $Application"
 ...
 

I think saving this per-application is better, at least for narrowing the problematic apps down, but I suppose it’s a matter of preference…

Now we see that all applications except for two are being updated. Without digging into what exactly is special about those two, it’s enough to go to the “Map modules to servers” in WAS console for any other app to see that the JACL script mapped all modules to the web server – including the EJB modules. Not only is that silly and useless, but also is causing “the filename is too long” thing. So we add another condition into the configureWebserverDefinition.jacl to only map Web modules to the web server, by
changing the block where we set targetMapList as follows:

 #--------------------------------------------------------------
 # Check if web server is already defined as the target
 #--------------------------------------------------------------
 set index [string first $newTarget $currentTargets]
 
 set targetMapList "$targetMapList { {$moduleName} $moduleUri $currentTargets }"
 if {($index < 0) } {
 set isWebModule [string first ".war,WEB-INF/web.xml" $moduleUri]
 if {( $isWebModule < 0)} {
 puts "$moduleUri is not a Web module, skipping..."
 } else { 
 puts "Will map $moduleUri to $newTarget"
 set targetMapList "$targetMapList { {$moduleName} $moduleUri $currentTargets+$newTarget }" 
 }
 }
 

The configureWebserverDefinition.jacl modified as above is at
Misc
module of jSewer project on SourceForge
.

Some notes

  1. One can argue that a step from one VM into another should not, from debugger’s point of view, be a step within one thread, but a new thread should be open. But this involves too much interfacing with the debugger (thinking about how to tell it that it needs to register a ThreadStartEvent to open a new thread when it didn’t request it, e.g.), so I scrap the idea, but feel it worthy of noting.
  2. %#@#$%@#$%@^@@^%

    java.lang.ClassCastException: org.hrum.dbdb.DelegatingThreadReference cannot be cast to com.sun.tools.jdi.ThreadReferenceImpl

    Did I mention how this pisses me off?

  3. Ok, even though Dbdb needs Java 6, the proof-of-concept is not supporting
    new JDBC driver
    loading
    … Just a note to self…

ReflectiveVisitor

Inigo Surguy presents< an implementation of Visitor pattern based on reflection. I found useful a slight modification of it, that allows for visiting not only the narrowest type, but every type possible , based on the visitAll flag: ReflectiveVisitor.java. This functionality comes in quite handy for tasks such as, for instance, validation. For example, if one has a complex class hierarchy (especially with multiple< inheritance) and different types have different requirements for it to be valid, and a concrete instance may implement a number of those, these validations can be well separated out.

See also:

  1. Visiting Collection elements
  2. Depth-first polymorphism

What’s going on here?

Consider the following code:

MethodEntryEvent evt;
ObjectReference con;
...

Class evtClass = evt.getClass();
System.out.println("Class of evt: " + evtClass);
System.out.println("Methods of evt: " +
                Arrays.asList(evtClass.getMethods()));
try {
  Value v = evt.returnValue();
  System.out.println(v);
} catch (Throwable ex) {
  ex.printStackTrace();
}

try {
  java.lang.reflect.Method retvalMethod =
          evtClass.getMethod("returnValue", null);
  retvalMethod.setAccessible(true);
  con = (ObjectReference)retvalMethod.invoke(evt, (Object[])null);
} catch (Throwable t) {
  t.printStackTrace();
}
System.out.println("Returned: " + con);

When running, this code prints the following:

 Class of evt: class com.sun.tools.jdi.EventSetImpl$MethodExitEventImpl Methods of evt: [ public com.sun.jdi.Value com.sun.tools.jdi.EventSetImpl$MethodExitEventImpl.returnValue(),  public java.lang.String com.sun.tools.jdi.EventSetImpl$LocatableEventImpl.toString(),  public com.sun.jdi.Method com.sun.tools.jdi.EventSetImpl$LocatableEventImpl.method(),  public com.sun.jdi.Location com.sun.tools.jdi.EventSetImpl$LocatableEventImpl.location(),  public com.sun.jdi.ThreadReference com.sun.tools.jdi.EventSetImpl$ThreadedEventImpl.thread(),  public int com.sun.tools.jdi.EventSetImpl$EventImpl.hashCode(),  public boolean com.sun.tools.jdi.EventSetImpl$EventImpl.equals(java.lang.Object),  public com.sun.jdi.request.EventRequest com.sun.tools.jdi.EventSetImpl$EventImpl.request(), public com.sun.jdi.VirtualMachine com.sun.tools.jdi.MirrorImpl.virtualMachine(),  public final native java.lang.Class java.lang.Object.getClass(),  public final void java.lang.Object.wait(long,int) throws java.lang.InterruptedException,  public final void java.lang.Object.wait() throws java.lang.InterruptedException,  public final native void java.lang.Object.wait(long) throws java.lang.InterruptedException,  public final native void java.lang.Object.notify(),  public final native void java.lang.Object.notifyAll()] java.lang.NoSuchMethodError: com.sun.jdi.event.MethodExitEvent.returnValue()Lcom/sun/jdi/Value; at org.hrum.dbdb.DriverManagerMethodExitEventListener.process(DriverManagerMethodExitEventListener.java:99) at org.hrum.dbdb.DbdbEventQueue.removeDebug(DbdbEventQueue.java:168) at org.hrum.dbdb.DbdbEventQueue.remove(DbdbEventQueue.java:47) at org.eclipse.jdt.internal.debug.core.EventDispatcher.run(EventDispatcher.java:226) at java.lang.Thread.run(Thread.java:619) Returned: instance of oracle.jdbc.driver.T4CConnection(id=435)(class com.sun.tools.jdi.ObjectReferenceImpl) 

Now, I will run this in debug mode and set a breakpoint at the red line above. When the breakpoint is hit, evaluation of evt.returnValue() returns an instance of com.sun.tools.jdi.ObjectReferenceImpl. However, when the execution is resumed, the result is as above (that is, evt.returnValue() results in a NoSuchMethodError).

Further, if we remove the green line (retvalMethod.setAccessible(true);), we will get an IllegalAccessException on the invocation:

Class org.hrum.dbdb.DriverManagerMethodExitEventListener can not access a member of class com.sun.tools.jdi.EventSetImpl$MethodExitEventImpl with modifiers “public”

What is going on?

I’d say it’s left as an exercise for the reader, but honestly, at the moment, I don’t feel like looking for an answer at all. I will perhaps let Bob and Dr. Heinz Max Kabutz (did I mention how much I enjoy referring to Dr.Heinz Max Kabutz?) to do the detective work…


ENVIRONMENT: This code is part of a plug-in project I am running in Eclipse 3.2RC3, with Mustang.

BOOK REVIEW: “Eclipse: Building Commercial-Quality Plug-ins”

I suppose this is more of a praise of Eclipse plug-in architecture and available documentation than a review of the book per se, but I did not get from Eclipse: Building Commercial-Quality Plug-ins anything I could not by scanning online docs and playing with Eclipse myself. I was up and running with my plug-in project in a very short time without opening this book, and once I did, I did not find anything I have not already learned or known where to turn for more info…

It may be easy to say that many such books are just a rehash of the wealth of online information already freely available, but sometimes the books do have added value, say, by presenting the material for faster learning and/or reference. In this case, there can be no such added advantage – again, because the Eclipse project’s own design and documentation is very clear and thorough…

I realized all that before getting the book; in buying it, I was looking for another advantage – hidden tips and tricks, kind of like Covert Java. For example, how do I debug a plug-in project that depends on a non-plugin one?

No such luck.

I’ll be returning this book to the store now, and maybe trying to see if Contributing to Eclipse: Principles, Patterns, and Plugins is closer to what I want…


Who debugs the debuggers, part III


…See also Part II

I suppose the Javadt approach ran out of steam. For some reason,
it now takes a horribly long
time to invoke the request on an ObjectReference that
represents a java.sql.Connection.
(A horribly long time is time enough to have a smoke, and then to come back, see it’s still not done and go surf the web enough to leave the zone.)
So I decide to bite the bullet and look into creating an Eclipse plugin…

…which turns out to be not too hard. And, while I am at it, I will use
Java 6, and undo
the horrific crap I did to get around the lack of MethodExitEvent.returnValue()
feature…

However, here’s little but symptomatic discovery (duh!).
Javadt does not like a null EventSet — it just does not
check for nulls (which is ok, I suppose, for a throwaway reference
implementation). So I was returning an empty EventSet to it all
the while. But Eclipse will indiscriminately call resume() on it,
which is not what I want. So I am back to returning null. Fine. But how many of such little things would render this “framework” not really a framework… Or should this all be configurable?

JDBC notes

  1. Executing multi-line statements

    Apparently, Oracle’s JDBC driver doesn’t like CR/LF endings. LF itself is ok. So this was needed:


    sql = sql.replaceAll("\r", "");

    See also:

    http://forum.java.sun.com/thread.jspa?threadID=669282&messageID=3914430

    http://groups.google.com/group/comp.lang.java.databases/browse_frm/thread/ea6e14e596db1546/83f97ffd119eedb2

  2. “Due to a restriction in the OCI layer, the JDBC drivers do not support the passing of Boolean parameters to PL/SQL stored procedures…”

I love it when…

…I spend time working on something under an [reasonable] assumption
that I can do X, spend some more time realizing that I actually cannot,
lots more on cranking out convoluted code for working around that limitation, and
then find
out that this X has in fact been implemented in a later release than
the one I have…

Here, the feature X is being able, upon an exit from
the method, get the value it returned. This feature is there in
JDK
1.6
. I don’t need it anymore for now though… Maybe I will…

Oracle and JPDA

I believed that I had to go through the pain to bridge DBMS_DEBUG to JDWP. I’ve already started to look into it, using GNU Classpath’s implementation. But it turns out that Oracle already supports debugging stored procedures with JPDA.

But all it does is that saves work on Dbdb, not makes it irrelevant. While adapting another debugger to JPDA is useful (and I may yet do it for something else), it is not the primary value of this project. It is in the unified call stack.

And David Alpern of Oracle claims they already have something like this, but it’s nowhere to be found. JDeveloper allows debugging stored procedures but the the single call stack, which is what I think is the ultimate value of Dbdb, is not there…

Who debugs the debuggers, part II


…See also Part I

The JDB approach is kind of painful. Perhaps, another already
existing debugger can be used to try my approach? So far, I am
intimidated by Eclipse (or NetBeans), and want something easier. The reason
is that at this point my idea of integrating with a debugger is modifying
it to supply my own implementation of a Connector
thus, I need the project whose code is easier to beat into an Eclipse project
and modify.

As far as more flexible integration with any debugger (for when
this project is “mature”) that would not require modifying source
of the said debuggers, I am considering the following. Since
every good debugger has a feature to “attach” to an executing
JVM, I will implement a Java-based “tunnel” for
JDWP.
Looks like GNU Classpath
has done all of the annoying work of implementing the spec.

After examining several alternatives, I have decided, for now, to
first use modified Trace, and,
when that runs out of steam, the Javadt.


I noticed have previously reinvented the wheel when I read
about JPDA but did not notice the existence of Trace! In an effort
to track down a culprit in an execution of an application, I’ve replaced
the JVM called with FoljersCristals — my homegrown version
of Trace. Do I feel silly now…

Who debugs the debuggers

Digression

The subject really must be in Latin, n’est pas? While I have no formal instruction in Latin, I should
come up with one — what with Latin’s
pretty formal structure, my general understanding
of syntax and “feel” for languages, my finishing a Natural Language Processing (6.863J) incomplete 10 years later,
Vocabula computatralia,
Mike McLarnon’s conjugation applet
and Verbix

Should it be “Quis emendabit ipsos emendatra”?

I should probably asked someone to translate it, which reminds me of a
recursive acknowledgment Littlewood
describes in
A Mathematician’s Miscellany. He talks about a translated paper that had three end-notes at the end of it:

  1. I wish to thank NN for translating this article
  2. I wish to thank NN for translating the above note
  3. I wish to thank NN for translating the above note

And that, of course, where it ends, for, though the author did not know the
target language, he was perfectly capable of writing note #3: by copying the second note…

So, to start with, I decided to go with JDB. First question is, how
best to use it in development:

Then came home and figured out that I have to:

  • put
    tools.jar from the JRE’s home (as different
    from JAVA_HOME, which, apparently, is assumed to be
    JRE’s home — to wit, if you have JDK installed, it’s, e.g.,
    D:jdk1.5.0jre rather than D:jdk1.5.0). In other
    words, dropping the tools.jar into the jre/lib/ext
    folder in addition to it's righteous place in
    <JDK_INSTALL_DIR>lib did the trick...

  • You should properly override name() of the Connector you're
    implementing correctly for diagnostic (so that you're not confused by
    the output of jdb -listconnectors) but that's a minor thing...

ProxyPlus?

I am backdating this entry, so I don’t remember quite what this was about…

E-mailed Shane:

Hi,

I am working on a project that necessitates precisely what you describe — Proxies of classes, not just interfaces. So thank you, that I, apparently, don’t have to delve into BCEL (I will let you know about the project when I have a bit more to show than I do now).

However, I’m a bit confused. In a description at http://www.gnufoo.org/java/java.html you point to your
junit.extensions.jfunc.util.ProxyPlus class, but it does not exist in the jcfe.jar I got from http://www.gnufoo.org/jcfe/, nor does it exist in the jfunc project. Where should I look for this functionality?

Thanks.

Got a bounce. So, for now, will do with explicit org.hrum.dbdb.JDBCDriverFactory‘s mechanism of going through drivers loaded, deregistering them and registering, in their stead, org.hrum.dbdb.DbdbDriver‘s that delegate to the registered drivers.

P.S. This is what P6Spy does

Monkey business


After reading <a
href=http://discuss.fogcreek.com/joelonsoftware/default.asp?cmd=show&ixPost=75607&ixReplies=51>this, I thought I’ll put in my couple of bucks… This is a RANT!

It seems that the orientation in business is towards “monkey”
programmers — those who do not think, but do as they are
told. This is because management, apparently (and justifiably),
believes that at any given time it is easier to hire a hundred
monkeys (those are trained ones, that do not type randomly,
and so less than a million and less than infinite time will suffice,
but this is not a good analogy anyway), than a Shakespeare – or even Dumas
(with his own monkeys, so that’s another bad analogy, woe is me!)

As a result, there are (the list is by no means exhaustive; Java is
the language unless otherwise specified — I think Java has produced
more monkeys who think they are software engineers than anything
else — at least VB does not lend one an air of superiority):

  • …monkeys who would rather sharpen the
    carpal-syndrome-inducing skills of cutting and pasting the same
    thing over and over again, rather than learn something like sed or Perl or a
    similar tool —
    or, indeed, spend some effort finding out about the existence
    of such tools and their availability on the monkey platform of
    choice (read: Windows) — or even finding out what plugins are
    available for their lovely IDE.

    IBM, for example, provides a framework called
    EAD4J, Enterprise
    Application Development for Java (it is only available with
    purchase of IBM IGS consulting services). It includes components
    similar to Struts, log4j, etc.
    The framework is well designed, but here is a catch — because
    of its design, adding or changing a service requires changes to
    about 8 files. There are abstract processes, process factories,
    interfaces, factories, XML files with queries, files containing constants to
    look up these queries, etc., etc. It would really be nice if there was
    a simple way to manage it, plugging in your logic where
    some IDE plugin or script do the, well, monkey
    job. Otherwise it’s overdesigned.

    Now, there are simple plugins for the current IDE of choice, WSAD, that at
    least allow generating these
    standard files (if not managing them, which is also important —
    change one signature, and you have to change several
    files). These plugins are provided by IGS
    But nooo, the monkeys here prefer to create all of this by hand. It’s
    a painful sight.

  • …macaques who cannot fathom how one
    could write a client-server application that does not communicate through
    XML requests embedded in HTTP, but – o, horror! – actually has its own
    application layer protocol.

  • …baboons who think that
    patterns
    are not merely possible (albeit very good) approaches to problems
    (and indeed are generalizations of good approaches to common
    problems that have arisen). In fact, they
    are the only way to solve problems, and that they must be copied from of
    the book, or else it wouldn’t work. They wouldn’t know a pattern they
    haven’t read about if it bit them on that place their head is
    forever hidden in. If GoF didn’t write about it, it ain’t a pattern.

  • Ok, I am tired of enumerating primate species. I’ll
    just give an anecdote.

    I wrote a module used by several teams. Because of the ever-changing
    requirements, some methods and classes became
    useless. I gave a fair warning by email, then I gave a second one by
    marking them deprecated in the code. I notice that the
    deprecated
    tags were periodically removed. I send mail about this, and mark them
    deprecated again. And again. And again.

    A monkey who was the team leader of another team came complaining that
    I should remove it, because he cannot perform a build. Everyone else
    can,
    but he can’t, and so I should remove the single tag (that is probably
    more useful to the whole project than anything he’s ever
    produced). He cannot be bothered to find out how to make
    it work? Why can everyone else make it work? Oh, he’s using some Ant
    scripts? What? That’s an excuse? What the hell does that
    have to do with anything? Oh, he didn’t write those
    scripts. Well, write your own, or take them from those
    people for whom they work. Oh, you don’t have time? Well,
    I don’t have time to keep giving you warnings you just
    ignore, you twit.

    Screw you, I finally thought, the warning has been there for some
    time. I’ll just remove this stuff altogether.
    His build promptly crashed. “Not my problem – we talked about this
    over 5 weeks ago!”, I gloated, producing the emails from my
    appropriately named CYA folder.

    As Butch said, “that’s what he gets for fucking up my sport.”

In short, they are not
Joel’s kind of programmers,
to put it mildly. Monkeys see and monkeys do. They do not think. They
have been taught a way to do things, and it is beyond them to figure
out that there could be another way. I honestly do not think they
understand what a boolean is (I submit that in their mind there is an
if statement, and then there’s a boolean type)
when they write:


if (thingie.isOk()) {

    return true;

} else {

    return false;

}

Then someone they blindly trust (it must be an established authority,
like a book/magazine — only that approved by an already established
authority, because monkeys do not further their education on their
own, — a manager, instructor at a paid course) tells
them about a ternary operator. Now they write:


return thingie.isOk() ? true : false;

The above two examples are from an actual production code.

Further, because monkeys do not think, they often reinvent the
wheel, badly. Which is also ironic, because they have been imbued with
all the right (and wrong) buzzwords, including “reuse”. I hesitate
to hazard a guess as to whether there is some meaning in their heads
they associate with this word, or is it just something they cry out
when playing free associations with their shrinks (“OO –
Encapsulation! Polymorphism! Reuse!”).

Here are some more anecdotes.

  • One programmer on a project wrote his own utilities to convert things
    from/to hex numbers, for crying out loud. Here is Java, the only thing
    he knows at all, and he can’t be bothered to think that maybe,
    just maybe, such a thing is a part of standard API.

  • This same monkey took several weeks to write a
    parser (for a very simple
    grammar, containing only certain expressions and operators such as
    ANDs and ORs). When I asked him why he didn’t use a
    parser generator (such as ANTLR, CUPS or JavaCC), he replied that
    he didn’t know any of them. Now, it is not a crime not to know a
    particular technology, but surely a programmer must be a) aware
    that there are such things as parser generators, and b) be
    able to learn how to use one. Whether he lacked the understanding or the
    desire to learn, is this the kind of developer you want?

  • Background: We needed to create some scripts doing export from the
    database. The export was to be done under some specific
    conditions, which were to be specified in the queries
    (that is, only export dependent tables if their parent
    tables are eligible to be exported, etc.) The logic was
    only in SQL queries, the rest were just scripts passing
    these queries to DB2 command-line, logging everything.
    All of those were written by hand, 80% time spent copying
    and pasting things, and then looking for places where the
    pasted things needed to be changed a bit (for example,
    some things are exported several times into different IXF
    files, because they are dependent on different
    things. These files need to be numbered sequentially, so
    next one does not overwrite the other. What do monkeys do?
    Number them by hand. Great.)

    When I suggested automating things, in fact, automating
    from the first step – even before writing our own queries,
    using the metadata to generate the
    queries themselves, I was looked at as if I just escaped
    from the mental asylum.

    Monkey But you cannot just rely on metadata, there are also
    functional links which are not foreign keys.

    Me Why are they not foreign keys in the first place?

    Monkey Because they are functional.

    Me Stop using that word. Tell me why are they not foreign keys?

    Monkey Because they are nullable.

    Me A foreign key can be nullable! Why is it not a foreign key?
    OK, whatever, that’s our DBA’s problem… But there’s a convention for
    functional keys anyway (we know they all start with
    SFK_, by convention). I’ll use that.

    Two days pass. My script works. A week later, they have problems
    with their original scripts. My approach works,
    demonstrably. But ok, they want to keep doing it their
    way, fine. They ask for help with their way – those scripts, wrapping
    hand-made SQL queries (which are already being automatically
    generated, but I’ll hold on that for now…)

    Monkey What are you doing?

    Me Writing a Perl script.

    Monkey But there is no Perl on Windows.

    Me See, I am sitting at a Windows machine and I have Perl.

    Monkey What is it for? I thought Perl was only for the Web?

    Me I am writing a script to generate your silly scripts
    from the small set of user input. The resulting files, which you
    are now doing BY HAND, are cluttered with repetitive stuff, such
    as error-handling code and file numbering, and it’s error-prone to do
    search and
    replace manually. So we’ll generate all these scripts using my script.

    Monkey But they don’t have Perl on their Windows.

    Me Who are “they”?

    Monkey The client?

    Me First of all, this is for the AIX machine. Second, this is
    not for them, we will just deliver the generated shell scripts, the
    Perl script is for us only.

    It takes several iterations for Monkey to get it.

    A day passes…

    Me Hey, where’s my Perl script I wrote to generate the import
    scripts?

    Monkey We have to have only shell scripts.

    Me Yes, I used that one to create those shell scripts, dammit!

    Monkey (sits writing these shell scripts again by hand. At the
    moment, manually replacing some upper-case strings into lower-case)
    I
    removed it from CVS. They only want shell scripts on their machine.

    Me It wasn’t going on their machine! It’s only for us!!!

    Monkey Here, I changed these files already, you change the
    rest.

    Me (giving up) OK.

    Monkey Oh, and they have to be K-shell. Change them all to
    .ksh.

    Me Why do they have to be ksh? What’s wrong with sh? They are
    all very simple anyway, just call db2 import, check error status,
    that’s it.

    Monkey They have to be K shell. That’s what the DBA said.

    Me What the hell does the DBA have to do with it?

    Monkey He wants to be able change them, and he doesn’t know sh,
    only ksh.

    Me Ok, fine. I suppose you’re right, echo is
    different in K-shell.

    Monkey misses the seething sarcasm. Of course.

    Monkey Right here.

    Me I don’t see them. What is this OAD_0035.ksh? Is that it?

    Monkey Yes.

    Me What does this mean? What do these numbers mean?

    Monkey That’s what they said they should be called.

    Me Who are “they”???

    Silence.

    Me OK, you have a script called OAD_0035.ksh calling
    OAD_0038.ksh, which in turn calls OAD_0038_1.ksh, OAD_0039_2.ksh, etc.
    Why are they called this? It’s hard to remember which one is which.

    Monkey Why do you want to know what it means?

    Me Because if I don’t know what it means, it’s much harder for
    me to look at the file and see what is supposed to be inside. Ah, I
    see
    you added the insightful comment inside each file with its meaningful
    name. Ah, I see also, you use that stupid name inside of it over and
    over again, to write to the logs, instead of just using $0. (deep
    breath). I’ll just create some symbolic links to them with meaningful
    names, so I know what’s going on…

    An hour later

    Me Where are my links?

    Monkey They only wanted files there that are named like
    OAD_0035, etc.
    Me What the hell do these numbers mean???

    Monkey I don’t know. For security.

    Me (pause) Who told you to do this?

    Monkey The client.

    Me The client is a company. Who have you met from the company?

    Monkey I don’t know. They said the client wants this.

    Me Who said? Where? When?

    Now I’m really curious. I turn with this question to
    others. Finally I come to the last monkey who knows.

    Monkey 5 Uh, the client has guidelines on what they are
    supposed to be called. It should start with OAD, and
    then underscore, then four characters.

    Me Why four characters?

    Monkey 5 For normalization.

    Me What normalization?! What can you possibly
    change in a
    luminous egg
    , I mean, normalize in a 20-line shell script?

    Monkey 5 So they can keep them consistent and do some things to
    all of them, regardless of what they are called.

    Me What can they possibly want to do with shell scripts? Rename
    them to some other numeric pattern? There isn’t even any method as to
    how they are named, it’s not like a certain number
    pattern means it’s dependent on the other. You just
    named them randomly…

    Curtain

    But hey, fire one, and the replacement is easy to find. That’s true.
    I suppose Henry Ford would be proud, but isn’t this a backward
    approach? You don’t need monkeys at all, most of this work can
    be automated.

    Maybe I need another line of work. 🙂