The Old Tricks Still Work Best

Hendrik Erz

Abstract: Seven years ago, I first worked with a large dataset. Right now, I again am back to the drawing board, trying to massage a huge chunk of data into something I can work with. This made me think of the old days.

Published on Monday, August 28th, 2023 by Hendrik | 6 min reading time

Seven years ago, in the spring of 2016, my first student assistant job was coming to an end. For about two years I worked at the German Federal Agency for Political Education (Bundeszentrale für Politische Bildung, BPB). But due to regulations on public fixed-term contracts, I couldn’t renew my contract again, so I had to find new ways to wander. As Bonn was a thriving town for research, finding a new job was luckily easy, and this time I even managed to land it at the university itself, more specifically, at the Center for Development Research (Zentrum für Entwicklungsforschung, ZEF).

I worked for a young Postdoc who had a research project on the Water–Energy–Food Security Nexus in Ethiopia. His demands were simple: He gave me a set of Stata data files and tasked me with bundling them into a new dataset for his research. Overall, the data amounted to approximately 1 GB in size. He assigned me a desk in a large corner office space with a bunch of PhD students and told me to get to work.

Over the next three months, we would have frequent meetings where he explained to me what new variables he needed, and I would dutifully modify my scripts to make them spit out those variables. Shortly after beginning my work, the combination of the heat of the room (it was summer and my computer was approximately as powerful as a large potato) and the dataset size made my work slow. In the end, I remember, an entire run from the raw ingest data to the final data frame took about 90 minutes of constant number crunching.

Since I had not much thinking to do, I began to think about the code itself. Naturally, the two biggest speed bumps were the computer and limitations in Stata’s engine. But there are ways and means of reducing the run time. I decided to split up the work, which lent itself naturally to the structure of the dataset: There were two waves (it was a longitudinal dataset), with four large questionnaires each. So I created a hierarchy of files: One master file that would in turn call the two main wave “mergers” (because in the end everything needed to be thrown together). And those two “mergers” would in turn call smaller helper files that would actually process the four questionnaire files.

I just took a look at the code again, and I even dutifully noted down any suspicious things that I found in the data. “In variable hh_s9q14 is a non-labeled valued 11”, I wrote. Or, “In assetsLastMonth there seems to be 1 outlier (43.600 BIRR compared to a max of 6755 otherwise)”.

Looking at the source code again from all this years ago is enlightening — and also shows that I always have tried to do my work diligently. But this is also the reason for today’s article: there is a connection between the work I do now and the work I did back then, in that way too hot corner office in the Summer of 2016.

As you may or may not know, my main research right now concerns the discursive dynamics unfolding in the 1970s and 1980s US Congress. For that, I took the last nine months to generate exactly three variables — categories of speech into which I could classify sentences from the speeches that representatives had made decades ago. And since about two weeks, that work is finally done and over with. But there was no time for celebrations: now the next task is to take that data and “do proper science” (don’t tell my supervisor I said it like this) with it.

In order to do so, I have to run regressions, but for regressions I need more than three target variables. Alas, I require something to regress against those variables; in other words: additional data sources. So I began constructing an entire, massive new data frame, and immediately the memories of that summer job were revived. The only difference now is that I am using R instead of Stata, and that I have to decide what variables to generate by myself.

Interestingly, my dataset now has approximately the same size as the one from back in the days — roughly the same amount of observations, a little less variables. But it takes just as long to process as the other dataset. Only that now my computer is approximately 20x as powerful as the potato back then.

Yes, I know, there is likely a way to increase the speed by a lot — I’m bad at R. Yes, it’s probably also language limitations. But that doesn’t matter. Because my code as it is now just works. I tried a little bit to optimize it but the resulting data format was off.

At that point, I had two options. I could either play a little more code golf and try to find elegant solutions to trivial problems. Or, I could just keep it at that and pull out an old trick of mine: Divide the work into chunks that write down their results in a TSV-file which only has to be regenerated if anything changes. Then I don’t have to care about how sluggish the execution speed of the data generation is. Instead, I can run the work once and run my analyses.

And this works flawlessly: The code looks a bit different (because after all it is not Stata code), but the principle remains the same: One file per chunk, included via source in a comment that I can uncomment as soon as that particular file has to be run again. The actual loading of the data frame when I need to tinker with new regression ideas is as simple as running load("main_df.Rda") which takes about a second.

From a programmer perspective, this may seem like the work of cavemen, but that does not need to bother me: The data looks exactly as it should, and the code works.

This is what I have been preaching here on the website as “pipeline programming”. Programmers create software that is centered around some internal state, and most problem-solving revolves around the issues that arise in conjunction with that state. What happens if you have an unrecoverable state? How do you prevent this? Should you use OOP or functional programming for that?

These are not the problems of the data analyst.

The data analyst needs exactly one state, and that starts with an empty data frame and ends with a full one. While I was looking at my old State code, I remembered that a while ago someone mentioned that Stata apparently only supports a single data frame in memory. They mentioned this as a limitation. But, to be honest, I don’t think so. I think that being limited to one data frame rather focuses you, because that’s what data science is all about in the end: You have this one, big chunk of numbers in your program, you massage it until it has the right format, and then run a bunch of statistical analyses on it.

Data science is a dirty profession. But cleaning up this dirty data step by step, gradually building a huge tower of clean observations and then running regressions on it has something oddly satisfying.

The Old Tricks Still Work Best

Suggested Citation

Send a Tip on Ko-Fi