This article has been prompted by a discussion yesterday on the Science of Science Slack channel. Caifan Du asked what the best way would be to store about 16GB of research data efficiently since it was becoming extremely slow to work with on her own computer. Almost immediately the idea was brought up to simply drop everything into an SQLite database. And to that I immediately had to respond “No!”
The data will now (probably) be transferred to a sharded MySQL database cluster and that will speed up almost anything by large margins.
On the Slack I was lacking the full explanation due to space because – as everything in life – it’s complicated™. So why shouldn’t you dump a ton of data into a convenient single file on your computer? And why is actually distributing the data across many computers much faster? In this article I attempt to explain the science behind that. It’s still fresh in memory because I had to learn it the hard way that you do not, and I repeat, do not drop Gigabytes of data into an SQLite database.
Disclaimer: This article is not arguing that you shouldn’t use SQLite at all. SQLite is an extremely convenient database format that will come in handy in lots of different situations. Be it that you want to simply mock an SQL database to test out some web server code or because you are a reasonable person and know that one simply doesn’t use Excel. The only thing you shouldn’t do with SQLite is try to dump the equivalent of one day of data on the entire internet into it.
What is SQLite?
First things first: What actually is SQLite? In simple terms it’s a library that enables you to write and read from
.sqlite-files. It’s basically a regular database – with the difference that it works with a single, self-contained file. Normally, databases (be that MySQL, PostgreSQL, MariaDB, or what not else) use a whole folder in which they store thousands of different files. They split up the data tables efficiently, maintain some indices and do a lot of other engineering magic under the hood. SQLite, on the other hand, is as simple as it can get: A single file, everything in it, and that’s it.
The benefits are obvious: You can simply copy the file somewhere else and that’s it. No exporting your data using SQL files that increase the actual size of the data due to the many
INSERT-statements in the backup file. Also, this makes SQLite good for phone applications. Since these apps have only limited space, and are additionally sandboxes, this is the best (and only) way to maintain a full database on the phone. This enables you to conveniently use the SQL language to develop mobile apps.
But the benefits don’t stop there. The SQLite team has found out that their database protocol can actually be up to 35 % faster than the file system if your files are less than 10kb large. Also, the theoretical maximum file size of a SQLite database is about 281 Terabytes. Crazy, isn’t it!? The latter argument was actually brought up as one of the reasons for choosing SQLite to store a large amount of data in the discussion.
So let’s talk about why these benefits unfortunately don’t hold up. Both arguments run up the same problem: size.
Theory ≠ Practice
The argument that SQLite files can theoretically be 281 Terabytes large is a spurious one. And it is being purported by the SQLite team despite them knowing better. The reason is that, yes, mathematically, an SQLite file can grow up to 281 Terabytes before the file will become corrupted. But that is a mathematical limit based on the page sizes of the protocol. It only refers to the fact of how many different positions the SQLite protocol can address, not to the actual amount of space a database can take.
Think about it: Do you know how large the biggest hard disks or SSDs are nowadays? They’re not much larger than maybe 10 Terabyte. I didn’t google that because the actual maximal size doesn’t matter since no single disk will have 281 Terabytes of space. And that is what you need: Since SQLite files are single, continuous files on your file system, they have to be stored on one, physical hard drive. You could theoretically split them up, but then they wouldn’t work anymore until you pieced them together again.
So the practical limit of the size of an SQLite file is much lower than the theoretical limit. The biggest limit is physics: There is only so much space on a hard disk, and if that’s full, then you’re not gonna increase the size of your database, no matter how hard you try. A second limitation is the file system. File systems have been built with reasonable file sizes in mind. That means: The limit for the files are even smaller than the limits for hard drive sizes. And then there’s speed: If you need an overview over the data you have you must at some point query every single piece of data in the SQLite file, and the bigger that file is, the longer this will take. In the end, the hardest limit on the size of an SQLite file will be your sanity: Anything above maybe 100 Megabyte, or a Gigabyte if you’re lucky will be so creepingly slow that you’re much better off with a regular SQL database.
The reason why I’m elaborating on this issue so much is because I see this argument everywhere on the net. And how are you supposed to know? Until you run into that problem yourself, you’d have to have a computer science degree to know that computers pose much bigger constraints on what you can do with them than any arbitrary protocol that could – in theory – handle up to 281 Terabyte of data. So always ask when someone tells you about some incredible size limit: Is this a mathematical limit or a practical one?
But speaking of “Until you run into that problem yourself” — yes, of course I did run head first into that wall myself. So let me tell you the story of yet another instance where my stubbornness brought me amazing knowledge – at the expense of valuable time I could’ve spent doing other stuff.
Why SQLite is Unsuitable as a File Storage
A few months ago, when I was collecting my dataset, I encountered a problem. As you might know I work with the Congressional Records dataset, that is: Everything that has been said or done on the floor of the U.S. Congress since 1873. The data comes actually in two parts. First, there is the so-called “Bound Edition”, which is everything up to 1993. Until that year, the Congressional Records were exclusively printed on paper and released every year as one bound edition, one part per day of session.
From 1994 onward, however, the Congressional Record was kept digitally. The benefit of this is – apart from a lack of OCR errors – that you can divide the data into very granular pieces down to the level of individual congressional speeches. And when I downloaded the data, I quickly ran into a problem: My own computer.
The thing is: In total, the data from 1994 until May 2021 is stored in 600,000 individual text files and that is a problem. First, there are some operating systems (not gonna name anything here, you know which operating system I mean) whose file system is severely limited to the point that, for example, filenames cannot be longer than an extremely small number of characters (which is still a problem in the most recent version 10 of said operating system which I will not name here) or that the maximum amount of files in a folder is capped at some point.
But even without any unreasonable theoretical limit of files in a folder, you will run into a problem. You know, I currently possess one of the fastest consumer laptops on the market, and I once accidentally clicked the directory where my 600,000 files were stored and … it hang. I literally managed to bring my file browser to a grinding halt by making it scan a directory with 600,000 files and render it. The problem is not getting a list of the files – that takes a few milliseconds at most – but that any file browser will display additional info for each file, such as an icon. And rendering things on a computer is one helluva mess. I know that because I tried several times. Native code hates graphics. But I diverge.
So with my file browser showing me the loading spinner of doom, I saw my fears confirmed: These 600,000 are just too many files and I need to cut them down. So what I did was I checked the file sizes and – lucky me – the average size of my files was well below the magic threshold of 10kb. So the knowledge of that article on the SQLite page dug itself into my working memory and I quickly hacked together some code that took all of the files and stored them in a single SQLite file.
Now I had a single file, which loaded as fast as it can get and which had only 8GB. Perfect, although I was a little bit sad that SQLite didn’t also compress the text contents, but I guess that was wishful thinking on my part. A database is not a zip file and, as I quickly realized, that’s good.
So I began happily working with that file and indeed: My calculations ran much faster across all files! I didn’t exactly measure the time, but it was significantly faster; I just don’t know if it was more like 20 or 35 % improvement.
And then I thought to myself “well, now that I can also access the metadata of the files pretty quickly, let’s just run this analysis on a subset of all files!”, wrote some code and hit “run”.
And I waited. And waited. And waited. Suddenly, my code was actually many times slower than it had been before! What has happened?
I double checked my code several times, had a look to see where my (obvious) logical error was, but to no avail. The same code that previously selected files to work on in a few seconds now took thousands of hours to do the same (I did not let that script run for the 3,000 something hours of estimated runtime).
How could that be?
Well, follow me into the dark abyss of software engineering and indices!
Indices to the rescue
The reason why the same database that could be queried insanely fast in one way came to a grinding halt when I tried more sophisticated queries is the same as the reason for why a sharded cluster of multiple databases can be faster for large amounts of data than a monolithic database on a single computer. The magic ingredient is: indices.
You know, I found the logical error once I ran a simple test: First, I attempted to query each file by its index, which is just the number of rows which keeps increasing as you put more data into the table. That one ran quickly. And then I attempted to query each file, but this time by its filename. I didn’t do any stuff with the filenames, I just said
SELECT * WHERE filename = "something" instead of
SELECT * WHERE id == 12345. And that took ages (if I had let it run, which I didn’t do, I’m a sensible person).
And this gave me the clue as to what was happening: Every database system maintains what is called an index. You can really think of them as the index of terms at the end of scientific books with page numbers so you can quickly navigate to the right place and look something up. Indices can be created for anything where you will need to look something up in a large amount of data. You can, for instance, create a word-index, that is: A separate file that contains one entry for each entry in your original data, but only contains the unique words in the original data. The index will be much smaller than your real data, but it will make finding stuff much faster. The computer will search that index for the word you’re looking for and give you back the database index where the entry containing that word is stored.
Something similar is what SQLite did: When I created the database in the first place, it created a pretty basic index: It stored a tuple of the database IDs and the exact point on the disk where that database ID was stored. So when I tried to access a file by its ID in my SQLite database, the code looked into the index file, saw “Ah, that ID begins at offset 12345!” and, instead of opening the entire 8GB file, it simply jumped to the beginning of the file (for example at address 98765) plus the offset 12345. And there the entry was. It could return the data I was looking for in a fraction of a second.
But as soon as I tried to search for something that was not indexed – the filename column –, SQLite had to actually open the database file and check every single entry from the very beginning until it found the entry I was looking for. You can imagine that sifting through 8GB of data takes time – even for modern computers.
So I quickly abandoned the idea of using a handy SQLite database to store all my data. In the end, I settled on combining the granular pieces of data into daily files, which resulted in less than 10,000 files instead of the original 600,000 and that was easy to display even for my file browser.
So the next time you have data at hand, don’t just think about “how can I conveniently store that?” but also about how you can ensure that your computer will not abandon you while attempting to find the literal needle in the haystack.
If you’re interested in learning about what indices are and why they are so incredibly useful, I recommend this beautiful explainer by Tom Scott, titled: “How Binary Search Makes Computers Much, Much Faster”.
Until next time, when I make a big mistake for us all to learn from!
Update Fri, Sep 17, 2021: I just scrolled through my notes today, and I found this gem of a paragraph I wrote about a year ago:
Research shows that reading and writing datasets to SQL databases of any kind is much faster than writing it to files directly. Pandas supports working with Excel and CSV sheets easily, but those have not been made with huge datasets in mind. However, Pandas support for SQL and SQLite is still missing. It’s possible to do that, but having a dedicated module would help shift data analysts away from Excel more quickly, because it would reduce the threshold of using SQLite or SQL by a large margin.
Oh, how wrong and naïve I was back then.