Athena Federated Queries: Azure Data Lake Storage, part II

In our previous installment, we learned that Athena does not support ADLS directly (without Synapse). I decided to try to rectify the situation. Initial draft here: https://github.com/debedb/athena-azure-adls

It totally sucks because it’s not useful performance-wise, too slow. But at least it’s got a connection…

But then again Dremio seems to be real good about it. It appears to work well with blob storage (ADLS on Azure, GCS on GCP, S3 on AWS). Even, in some cases, better than Athena with all the blobs in S3.

I may add benchmarks if I can.

To be continued…

Credit where it’s due

Microsoft has a fix for an issue quite quickly (mentioned in a previous post).

Figuring out the reason for the magic number to backtrack from, though, I had posited another reason, and I was wrong… And overall it now reminded me of:

The appearance of our visitor was a surprise to me, since I had expected a typical country practitioner. He was a very tall, thin man, with a long nose like a beak, which jutted out between two keen, grey eyes, set closely together and sparkling brightly from behind a pair of gold-rimmed glasses. He was clad in a professional but rather slovenly fashion, for his frock-coat was dingy and his trousers frayed. Though young, his long back was already bowed, and he walked with a forward thrust of his head and a general air of peering benevolence. As he entered his eyes fell upon the stick in Holmes’s hand, and he ran towards it with an exclamation of joy. “I am so very glad,” said he. “I was not sure whether I had left it here or in the Shipping Office. I would not lose that stick for the world.”

“A presentation, I see,” said Holmes.

“Yes, sir.”

“From Charing Cross Hospital?”

“From one or two friends there on the occasion of my marriage.”

“Dear, dear, that’s bad!” said Holmes, shaking his head.

Dr. Mortimer blinked through his glasses in mild astonishment. “Why was it bad?”

“Only that you have disarranged our little deductions. Your marriage, you say?”

Which in turn reminded me of

И вот какого хрена “Shipping Office” переводится как “пароходство”?

Athena Federated Queries: Azure Data Lake Storage

Well, this one is super broken, which one finds out after shaving a number of yaks.

We want to query Parquet files that sit in Azure Data Lake Storage with Athena. AWS has what seems to be a nice documentation on how to do it… Except:

  1. Searching for it in Serverless Application Repository with “azure” or “adsl” terms is not yielding anything.
    • Additionally there seems to be a bug there, per AWS support:

      Issue:
      – The search functionality appears to be unresponsive when using the traditional “Enter” key method
      – This seems to be a technical bug in the console
      Workaround:
      – Enter your search term in the search bar – Instead of pressing Enter, click anywhere on the screen
      – This should trigger the search functionality and display the results

    • Search for something like “gen2” actually yields something… It’s a AthenaDataLakeGen2Connector — which is the same thing as below, so read on.
  2. Trying to add the Data Source from Athena, selecting “Microsoft Azure Data Lake Storage (ADLS) Gen2” connector… It is based on athena-datalakegen2 code which is borken because the underlying mssql JDBC driver is borken.
  3. After patching the mssql driver and the connector, we realize that it is trying to connect via JDBC to ADLS, but that is not supported. And yet AWS claims “the documentation is correct“.

Srsly now, AWS and Microsoft, you even tested anything?

It’s already 2025, and still

Some Postman rants and tips and tricks

I like Postman in general. But some things are annoying, so there…

APIs and Collections and Environments

APIs are great, and equally great is their integration with GitHub, and ability to generate Collections from API definitions and have them be updated when API definition changes. Nice. Except… those Collections cannot be used to create Monitors or Mock servers, you need to create standalone Collections (or copy those you generated from under APIs). But now those don’t integrate with GitHub. There is a fork-and-merge mechanism that kind of takes care of the collaboration, but that those modes are different is annoying. Ditto Environments. What’s up with that?