Optical Character Recognition (OCR) is hard. In fact, it’s so hard that today there is basically only one single platform that does this reliably: Tesseract. The problems of OCR are manifold and begin at simply bad scans, through different typefaces used, to the fact that sometimes it’s impossible to distinguish an “l” from an “I” without any context.
We all know the problems of OCR via means of reading PDF files: When we encounter an older paper or book that never existed digitally and has only been scanned, it either has no OCR-layer (that is: you cannot select text since such a scan is effectively just an image) or the OCR-layer (the selectable text that is hidden below the image) contains errors. And the worse the quality of the scan is, the worse they get.
Introducing OCR Errors
Take the following example, taken from the congressional records of July 3d, 1873:
A.DJOURN~IENT SINE DIE.
As a human, we’ll have no problem identifying that this should read “Adjournment sine die”, even though you may have never encountered that expression before. If you just read a text and maybe highlight some text, these OCR errors won’t matter much. Even if you are to extract that text by copying it somewhere else, you’ll have an easy time fixing those errors manually and you’re good to go.
But now imagine you were sitting on top of 30 Gigabyte of textual data, all of which contains OCR’d text with these errors. And now imagine that you must correct those errors because you want to run some analyses on the data and for a computer there is a difference between “A.DJOURN~IENT” and “ADJOURNMENT”, then you have to perform what is generally subsumed under OCR post-processing or OCR error-fixing or simply data cleaning.
What do you do?
Obviously you can’t read 20,000 days of recorded text and manually fix those errors. I mean, theoretically you could, but then you could also shoot yourself in the knee. Doing everything manually is a waste of resources, and we most certainly can utilize decades of computational research for that.
So the first thing you want to do is go look online, because we’re in 2021 and we have computers that can generate code and even whole novels, so there must be some program that can do that for you! But you will quickly realize: It’s not so simple.
To begin with, there is no single program or library where you could simply call some method “cleanup_text()” that will fix those errors. There is a litany of different libraries and programs available, many of which are quite sophisticated. But these generally tend to be cumbersome to use. So you take a step back. And think about: “What do I actually need to do?”
OCR-Errors are Unique to the Data
The first step before any OCR can happen is to actually take a look at the OCR’d text and see if you can already spot patterns in the errors. This helps you to understand the different kinds of errors that are in your data. Every text dataset suffers from different errors. These errors are a result of the combination of the typeface used to typeset the text, the quality of the scans, the quality of the OCR-program, and the settings of the latter.
The most common error, present in almost any OCR’d text, is that sometimes characters are misrecognized, and ADJOURNMENT
becomes A.DJOURN~IENT
. This error is actually quite easily fixable if you have a certain knowledge about your text. In the case of the congressional records, with which I work, we know that they contain valid English words, so it suffices to split the text using white space and then literally running each word through a spellchecker. If it’s wrong, generate suggestions and replace the word.
However, even here it will help to think about what your soon-to-be-written algorithm will encounter in your data. The congressional records frequently contain speeches by representatives. These are always introduced with the pattern <Prefix> <LASTNAME><period>
. The prefix is normally either Mr.
or Mrs.
(the congressional records have not yet entered the golden age of LGBTQI*-existence) or The
. The latter prefix normally tends to be followed by special instances of <LASTNAME>
, namely PRESIDENT
or VICE PRESIDENT
.
This leads to an incredibly important insight: It will not suffice to simply load a spellchecker. You additionally have to have a list of the family names of every representative that served on that day, since they could be holding a speech and most spellcheckers don’t contain every family name in existence, thus risking that “MITCHELL” gets “corrected” to “MISCELLANEOUS”. Furthermore, even though a spellchecker offers you functionality to suggest correct words, it’s effectively just a list of words, so you might want to download multiple lists of English words, and even generate your own one (if you, like me, happen to have thirty years of congressional records that luckily did not suffer the fate of OCR). Afterwards, it is easy to get “ADJOURNMENT” as a correct alternative to “A.DJOURN~IENT” using the Levenshtein distance between both words. “ADJOURNMENT” will have the shortest distance (as opposed to, for example, “ADMONITION”).
But let us move on to the more interesting errors. The congressional records’ OCR contains a very peculiar error that I discovered while reading through the files: Malformed hyphenation. When you have justified text, you will frequently encounter hyphens splitting longer words across line boundaries:
This is text which con-
tains some hypheniza-
tion, which is probably
horrible to read but you
know what I mean.
In the OCR-layer of the congressional records, the hyphens at the end of a line are missing. I am not sure where they got lost – either in the OCR itself or when I extracted that OCR layer into a plain text file. But as a matter of fact, they’re gone. If they were present, it would take me five minutes to write an algorithm that looks at each line and tests if it ends with a hyphen. If so, I could simply remove the hyphen and replace it with whatever is on the next line. You might already see that even this simple algorithm makes assumptions: It assumes that hyphenation is present in the text and that the text contains words delimited by white space. The solution is to simply check every word at the beginning and the end of a line, see if it gets reported as erroneous, and, if so, see if the concatenation of that word with the last one from the previous or first one from the next line suddenly gets reported as right, you can be pretty sure there was one of the missing hyphens there.
Another error I encountered is that some lines are completely wrongly segmented. Take this example, from the records on February 19th, 1991:
N A V A L R E S E R V E .
T H E F O L L O W IN G N A M E D C A P T A IN S O F T H E R E S E R V E
O F T H E U .S . N A V Y F O R P E R M A N E N T P R O M O T IO N T O T H E
G R A D E O F R E A R A D M IR A L (L O W E R H A L F ) IN T H E L IN E
A N D S T A F F C O R P S , A S IN D IC A T E D , P U R S U A N T T O T H E
P R O V IS IO N S O F T IT L E 1 0 , U N IT E D S T A T E S C O D E . S E C -
T IO N 5912:
No problem deciphering that, right? Well, no. We generally treat English text in computer systems by splitting it by white space and only creating one entry per word (i.e. a vocabulary). This could look like this:
var vocabulary = ["Naval", "Reserve", "The", "following", "named", "captains", "of", "U.S.", "Navy", "for", "permanent", "promotion", ...]
However, if we just split the above text by white space without any preprocessing, we would get the following:
var vocabulary = ["N", "A", "V", "L", "R", "E", "S", ...]
Congratulations, you just extracted the English alphabet!
What is required in preprocessing here is what is called re-segmentation. Re-segmentation involves removing all white space and then randomly inserting white space in between the characters until the constituent words of said line emerge (remember: A computer doesn’t have an eye to “see” the obvious, it has to actually brute force its way to the correct result). In general, it works like this:
T H E F O L L O W IN G N A M E D C A P T A IN S O F T H E R E S E R V E
// Remove whitespace:
THEFOLLOWINGNAMEDCAPTAINSOFTHERESERVE
// Randomly insert whitespace:
TH EFOLL OWING NAMED CAP TAINSO FTH ERESERVE
// Keep "NAMED" since it's an actual English word, shuffle around the whitespace:
THE FOLLO WING NAMED CAPTAI NSOF THE RESERVE
// Keep "THE", "NAMED", "THE", and "RESERVE", shuffle the whitespace
THE FOLLOW ING NAMED CAPTAINS OF THE RESERVE
// Fix the last error:
THE FOLLOWING NAMED CAPTAINS OF THE RESERVE
This approach has been presented in Peter Norvig’s chapter on Natural Language Corpus Data, available on his website.
From Errors to Implementation
As you can see, OCR errors always have a clear problem statement and a valid solution. However, getting a computer to do that is much harder than it seems. It is interesting to note that three distinct approaches to this problem have emerged over the past decades: A Bayesian, a Frequentist, and a Deep Learning based. Very roughly speaking (and possibly to the detriment of all Bayesians reading this – sorry!), a Bayesian approach works using so-called a priori beliefs, for example the belief that the word “the” should occur overly often in an English text. The Frequentist approach works similar, but instead of pre-emptively suggesting probabilities with which the word “the” will occur, a Frequentist will first take some textual data where they know that the text is correct, and simply count the amount of times the word “the” occurs.
That being said, a Bayesian approach might yield better results for so-called “low resource” languages (i.e. those where only few textual sources are available) since it allows one to manually tune probabilities before letting loose the OCR correction algorithm. A Frequentist approach, however, while requiring more initial correct data to draw information from, could be more precise, especially if you can take information from the very same source of data from which the erroneous data came from. The last approach, deep learning, is potentially the most accurate, but also the most labor-intensive. Since neural networks learn by identifying patterns, you need quite a lot of data that maps, for example, the word “A.DJOURN~IENT” to “ADJOURNMENT”. So not only does deep learning require some gold-standard data, but it also requires you to manually perform this translation work for a few texts first, before having the network fix the rest of the dataset for you.
I am still considering which methods will yield good results. OCR error correction is not really standardized, so every data cleaning endeavor is a new process of trial & error. Of course I will at some point in the future write a follow up outlining the procedure I chose for the congressional records.
See you then, and stay sharp!