CodeBEAM V America 2021 21 Jun 2021
Back in March I gave a talk at CodeBEAM V America about some of the work that I did at Geometer in 2020. I started work at Geometer the same week that the SF Bay Area began its lockdown for Covid-19—while the purpose of Geometer was to be a startup incubator, on my first day there was an all-hands where Rob, the founder, said that for the forseeable future we would devote our efforts to pandemic relief.
While working on these projects, I saw parts of the US health system that I knew existed, but had no idea just how discouraging they would be. I also met (virtually) health care workers and departments of health officials who were throwing themselves into work to help save lives, sometimes in spite of the technology that should have been solving their problems, but instead caused other greater problems.
TLDR;
- Before deploying a Broadway pipeline to production, really try to understand GenStage.
- AWS promotes Lambda as a general purpose data processing tool for high scale workloads. I found Lambda to be incredibly difficult to monitor or debug, with quirks in the runtime that were only testable through trial and error.
- Broadway/Oban/Flow could easily handle much greater scale than we were solving for, in a resilient runtime that was much easier to inspect and debug.
- Be thoughtful about naming.
- I started by grouping data pipelines into high-level domains specific to ETL.
Pipelines pulling data into our system were grouped into
Extract
. Pipelines putting data into an external system were grouped intoLoad
. While technically true, this was the opposite of what new teammates expected when seeing the word “load.” - A more clear vocabulary might have been that used by
Membrane Framework, ie
source
,filter
, andsink
.
- I started by grouping data pipelines into high-level domains specific to ETL.
Pipelines pulling data into our system were grouped into
What I saw about the politics of money within the health care system of the United States makes me so angry, I consider expatriating to a nation that prioritizes the health and well-being of its citizens over partisanship and nepotism. My feeling through the summer of 2020 was that everything everywhere needed to be burned down, then rebuilt from the ground up with human-centric values. By Fall 2020, I had decided that while I could be proud of the work that I had done, I was not the right person to solve the systemic problems within US health care.
From a programming perspective, I learned that Broadway is an extremely powerful tool for processing streaming data. Broadway is a framework built on top of GenStage.
Many things that I had assumed about composing pipelines using Broadway were wrong. I also thought incorrectly when starting out that I could use Broadway without learning GenStage. By trial and error, and by iterating through problems serially from the data source to the data destination, we learned how to build discrete data pipelines to perform specific data analysis and filtering at each step of our pipeline, using Postgres as a cache and a historical record of events that had transpired in our system.
Things that might not be clear from the slides, but which I talked about in the presentation:
- Source files provided to us included all records from the beginning of time.
- At the time we launched our product, each file provided every 30 minutes included almost 500k records, or roughly 24M rows per day.
- Node vs Ruby vs Elixir
- Postgres database migrations in NodeJS: WAT. Why is there not one tool that solves this problem in a resilient fashion?
- Oban vs Broadway vs Flow
- Oban is a great tool, and we used it wrap the last leg of our data processing, ie calling into the API of the contact tracing system. This provided built-in retry capabilities, so that we would not have to write our own in Broadway.
- Oban uses Postgres for its queue, making it unsuited for distributing the workload of file processing, ie 24M rows per day, where only a few hundreds or thousands of rows were new.
- Our Broadway file processor turned out to be overkill and difficult to understand the code, so something like Flow might have been a better long-term solution. Maybe Oban running a single worker, using Flow for concurrent row filtering.
Given an infinite amount of time and money to redo projects, I might skip Broadway in the future and go straight to GenStage. This would have allowed us to build one pipeline from end to end, using Postgres or another tool as an event log, but without using the event log as an event source. This is easy to say post-facto, however—back in June 2020, I did not have the context to even evaluate this as a solution.
That said, I found Broadway to be a great entry into GenStage, and would definitely consider using Broadway again.