What seems like ages ago, I wrote some Python scripts to keep an eye on my home lab. What I did not realize is that little introduction to Python would help me dive into the wonderful world of ETL (or ELT, more on that later).
Manage By Numbers
I love numbers. Pick your favorite personality profile, and I come out as the cold, calculated, patient person who needs all the information to make a decision. As I ramp up and improve my management skills after a brief hiatus as an individual contributor, I identified a few blinds spots that I wanted to address with my team. But first, I need the data, and that data currently lives in our Jira Cloud instance and an add-on called Tempo Timesheets.
Now, our data teams have started to build out an internal data warehouse for our various departments to collect and analyze data from our various sales systems. They established ELT flow for this warehouse with the following toolsets:
- Singer.io – Establish data extraction and loading
- dbt – Define data transformations
- Prefect – Used to orchestrate flows via Singer Taps & Targets and dbtCore transformations.
Don’t you mean ETL?
There are two primary data integration methods:
- ETL – Extract, Transform, and Load
- ELT – Extract, Load, Transform
At their core, they have the same basic task: get data from one place to another. The difference lies in where data transformations are processed.
In ETL, data is extracted from the source system, transformed, and then loaded into the destination system (data warehouse) where it can be analyzed. In this case, the raw data is not stored within the data warehouse: only the transformed data is available.
ELT, on the other hand, loads raw data directly into the data warehouse, where transformations can be exercised within the data warehouse itself. Since the raw data is stored in the warehouse, multiple transformations can be run without accessing the source system. This allows for data activities such as cleansing, enrichment, and transformation to occur on the same data set with less strain on the source system.
In our case, an ELT transition made the most sense: we will have different transformations for different departments, including the need to perform point-in-time transformations for auditing purposes.
Getting Jira Data
Jira data was, well, the easier part. Singer.io maintains a Jira tap to pull data our of Jira Cloud using Jira’s APIs. “Taps” are connectors to external systems, and Singer has a lot of them. “Targets” are ways to load the data from tap streams into other systems. Singer does not have as many official targets, however, there are a number of open source contributors with additional targets. We are using the Snowflake target to load data into a Snowflake instance.
Our team built out the data flow, but we were missing Tempo data. Tempo presents some REST APIs, so I figured I could use the tap-jira code as a model to build out a custom tap for Tempo data. And that got me back into Python.
I’m running WSL2 on Windows 11 with Ubuntu 22.04. I finished up my tap development, things seemed to be working fine using
virtualenv to isolate, and I had finished testing on my Tempo tap. I wanted to test pushing the data into Snowflake, and upon trying to load the target-snowflake library, I got a number of errors about version incompatibility.
Well hell. As it turns out, most of the engineers at work are using Windows 10/WSL2 with Ubuntu 20.04. With that, they are running Python 3.8. I was running 3.10. A few quick Google searches and I quickly realized that I need a better way to isolate my environments than
virtualenv. Along comes a bigger snake…
My Google searches led me to Anaconda. First, I’m extremely impressed that they got that domain name. Second, Anaconda is way more than what I’m using it for: I’m using the environment management, but there is so much more.
I installed Anaconda according to the Linuxhint.com guide and, within about 10 minutes, I had a virtual environment with Python 3.8 that I could use to build and test my taps and targets. The environment management is, in my mind, much easier than using
conda command can be used to list environments and search for available packages, rather than remembering where you stored your virtual environment files and how you activate them.
Not changing careers yet
Sure, I wrote a tap for Tempo data. Am I going to change over to a data architect? Probably not. But at least I can say I succeeded in simple tap development.
A Note on Open Source
I’m a HUGE fan of open source and look to contribute where I can. However, while this tap “works,” it’s definitely not ready for the world. I am only extracting a few of the many objects available in the Tempo APIs. I have no tests to ensure correctness. And, most importantly, I have no documentation to guide users.
Until those things happen, this tap will be locked up behind closed doors. When I find the time to complete it, I will make it public.