About a year ago, I published an article titled “Why you shouldn’t use SQLite”. In the article, I described my failed attempt at storing half a million documents inside a single SQLite database file and concluded from my mistake that “You shouldn’t use SQLite” (in general, not “I” shouldn’t use SQLite).
All of this was prompted by a short discussion on a Slack channel in which someone suggested that one can safely store about 16GB of research data in an SQLite database. With my failure still fresh in mind, I disagreed and suggested not to use it. But, since Slack is Slack and not a blog, I promised to publish the reasons for why that may be a bad idea in an article, which I then did.
I collected a few arguments pro and contra using it and ended with the recommendation not to use SQLite for research purposes. Last week I found out that four months ago someone had posted my article to Hackernews and the ensuing discussion unearthed a few very interesting points, to which I would like to respond to in this article. Many of the points raised there are valid criticisms of my article, proving many all of my original arguments wrong. In short: It was a trainwreck.
The Problems In My Argument
The first argument that is completely bogus is about the speed: I argued that, if you’re not careful, the access times of the database will be actually slower than to simply use the file system.
Now, obviously there is a large flaw in this argument if you know anything about databases. Specifically, I never thought about the option to just create additional indices for columns I frequently addressed. Had I done this, the SQLite database would’ve indeed been blazingly fast and, on top of that, just one file instead of the 600,000 I had previously. This was one of the immediate feedbacks I got, and yes, this is a very big oversight on my side. SQLite by default only creates an index on the index column and if you query anything except that column, of course it will take ages to sift through (in my case) 60GB of data.
Another flaw in my article was that I mentioned “Oh god, my file browser crashes when I try to open a directory with 600,000 files!” The reason why this argument is simply stupid is that you can also just not open such a directory in the file browser. Handle those files using code, and you’ll be good.
The commentors brought up a third problem. In my article I said that the theoretical limit of 281 Terabyte for a single SQLite file was impractical since no hard drive could store such a large file. Several people subsequently noted that, by using RAID or LVM systems, you can actually increase the available size of a single drive up to much more than 281 Terabyte. Such storage solutions contain a ton of hard drives (one variant is literally called JBOD — “Just a Bunch Of Disks”) and their operating system shares them via the network as a single, large drive. So my argument that the 281 TB file size limit would be purely theoretical was wrong as well.
At this point, it really doesn’t look good for me, does it? All the technical reasons for why using SQLite would be a bad idea are dismissed, so I have to accept my fate: Using SQLite is completely fine. Right?
Well, not quite.
Why You Still Probably Shouldn’t Use SQLite
Technically, I was wrong, and using SQLite has neither speed problems (in the end, SQLite is up to 35 % faster than the file system), nor size problems.
However, I would say there is still a fundamental reason for why using SQLite may be a bad idea. My problem was that I never articulated it in my original article. Instead, out of fear that I would be wrong on the internet™ (which, ironically, is precisely what happened) I collected “objective” reasons for why SQLite is bad; technical details which I used to build my strawman. Think about it: How is the information about SQLite’s maximum file size even relevant if my main problem was speed?
The actual reason why I think you shouldn’t use SQLite is one of time constraints: Implementing a layer of SQLite will probably take you more time than simply to reduce the amount of files. (Note that I'm still talking about research data which normally doesn't change after you've collected it.) Also, it will increase the complexity of your code and thus make errors more likely. After all, an SQLite database that only works as a file system adds an unnecessary intermediary layer to your code, doesn’t it?
Accessing files is normally much simpler than querying an SQL database.1 Even though both R and Python have means of working with SQLite directly, it will still cost you a few more lines of code in which you can accidentally put a lot of mistakes. One commentor pointed out that Pandas now has a single function call to read in SQL. However, this still requires you to (a) create a connection object (the second parameter of the function) and (b) to write an error-free SQL query.
This points to another argument I could’ve made (but didn’t) involving the skillset of researchers: while most of us are very proficient with Python or R, normally we have less experience with SQL. And, given that the implementation of SQL in most programming languages is always somewhat awkward, this can slow down the research progress. Plus, the added complexity makes mistakes more likely. If you don’t believe me, read this XKCD on the havoc erroneous SQL statements can wreak.
The important trade-off to think about here is time: If you estimate that optimizing your code will take longer than just dropping all of your data into a database, then please, do the latter. If you are very proficient with SQL, this is likely the case. My point is that many people (me included) are not well-versed in SQL and, in order to make your own analysis less error prone and finish on time, it is best to always default to no-SQL-solutions first, such as simply reducing the amount of files.
Conclusion
Let me end first by thanking all those people who criticized my argument and pointed out where I was wrong. I’m not a computer scientist, and the thread on Hackernews proves this well. However, I think the thread also shows that, in order to fully realize the benefits of databases vice-versa the file system, there are a lot of things you need to think about.
I am always fascinated by the tools that are available to us, both hard- and software, but one thing I have learned the hard way over the past two years is that sometimes what is required is not an ideal but a working solution. Yes, using SQL(ite) has a certain elegance to it, but if implementing it takes five days, it is simpler and faster to just use the boring file-system approach.
But without having discussions such as the one we had across these two articles, I may still be stuck with my firm belief that SQLite is just awful. Which would be wrong. I am looking forward to more discussions like this one—and using more SQL(ite) databases in the future!
-
Both R and Python offer a single function to open and read a file. To query any database, however, you first need to set up a connection, then prepare a statement, and then execute the statement before getting back the entries. Therefore, querying SQL databases is more complex than working with simple text files. ↩