How hard can line endings be, really? A horror story from integration hell

Hendrik Erz

Abstract: They are invisible to the eye, yet crucial for writing: Line endings. While most of the time, it doesn't matter whether you use Windows or Mac, it becomes a central piece of writing when it comes to detecting linefeeds. In this article, I share a short story on how you can really mess this – seemingly simple – task up.

Published on Friday, June 7th, 2024 by Hendrik | 14 min reading time

About one and a half years ago, I wrote an article on how for years I failed to properly integrate a module into Zettlr. That felt very relieving, because doing so was indeed difficult. It was less a lack of my own abilities, and more the intrinsic difficulty in working with Electron and native code. Today I have another story from the antechamber of where that came from. And it is going to focus on something inconspicuous: linefeeds.

What is a Linefeed?

Linefeeds are really simple to understand. There are fundamentally only two types: “soft” wraps, and “hard” wraps. A soft wrap is called this way because it is inserted on demand, or “when necessary”. Take a long line that would overflow the boundaries of where the text is supposed to go. Software usually figures that out and then decides “Oh no, I should put a line break here.” Take this article: If you resize your browser window, you will see how the browser tries to shift line breaks around to ensure that the text is always properly justified. Indeed, from the computer’s perspective, every single paragraph in this article is just one long line of text without any “real” – or “hard” – line breaks.

“Hard” breaks are called hard because you, the user, inserts them intentionally. For example, to start a new paragraph of text, you will press the return key to force the cursor to the next line, even if the current line of text does not extent to the end of the box. These line breaks will always be respected. Even if there is only a single character of text on a line. If there is a line break, the software will not simply decide to cramp everything into a single line just because it would fit. This is why we can do ASCII art.

But while the concept is simple, the practical implementation is hard — as always.

Linefeeds in Software

The first issue starts with the fact of how linefeeds (or breaks, or endings) are implemented in software. Linefeeds are pretty much as old as typewriters. Yes, you heard that right: We’ve had that since the days of typewriters. In fact, back in “ye olde days” there was no soft wrapping. There was no software that could shift text around, because all was written directly onto a piece of paper. So if you forgot to make a line break and suddenly realized the word you were about to type overflows the text box, you had to literally insert a new piece of paper and rewrite everything.

Then came along electronic typewriters (the precursors to modern printers) that could be used with a computer. Basically you would write something on a computer (or some device would receive a telegraphic message), and the text would get sent to the typewriter to be typed automatically. Typewriters are still very manual, even if they’re electronic. This means that the software that would send text to the typewriter needed some way to tell the typewriter when to start a new line. There was only one issue: Starting a new line was not as simple as just moving the paper one row up. You also had a thing called a carriage. And that thing had to return to the beginning of the line before the typewriter could print more letters.

In other words, on a typewriter, you had to perform two tasks to go to the next line: First move the paper up a line, then move the carriage back to the beginning of the text. This historical situation is what gave us not one, but two symbols to represent new lines: \n, called linefeed (LF) and \r, called carriage return (CR).

As electronic typewriters were phased out, some software decided to stick to certain conventions that have already evolved by that time. And this is why, until today, there is no uniform symbol to characterize line endings. Specifically, there are four types:

\r\n: The MS-DOS line ending that is used by Windows until today. Abbreviated “CRLF”.
\n: The Unix line ending (and the most widely used on the planet). Abbreviated just “LF”, it is used by macOS and anything that runs Linux (and, since there are probably more servers than computers around, therefore the most frequently used).
\r: Out of a decision that I cannot really understand, some early computers (most notably the Commodore 64 and Apple II) decided to use simply carriage returns to indicate newlines. It should be noted that \r symbols sometimes are also part of programs or protocols.
\n\r: Why stick to CRLF if you can also use LFCR? This is what gives us this – surprisingly least used – line ending that – according to Wikipedia – is mainly used by the Acorn BBC and RISC OS (systems from the 1980s).

There are even more variations on this theme out there, but these are so exotic that I won’t go into details here. If you’re interested, here’s the Wikipedia article.

Detecting Linefeeds

Now to the spicy bit of the article: Why is that worthy of an entire article? Well, because it’s from the antechamber of integration hell. Let me explain.

By the time I started developing Zettlr, I was already using Apple computers on a daily basis, and had a few servers running Debian lying around. So my entire world consisted of LF. Indeed, by that time, I already had the cultural expectation that \r would be a superfluous symbol.¹ Nevertheless, when I released the app, I knew that it would also be used by Windows users, so I knew linefeeds would be important. As such, I added a small piece of code that is capable of detecting the linefeed so that files edited on Windows would keep their weird CRLF line endings, and all others would stick to the sleeker LF:

file.linefeed = "\n"
if (fileContent.includes("\r\n")) {
    file.linefeed = "\r\n"
}
if (fileContent.includes("\n\r")) {
     file.linefeed = "\n\r"
}

So what is wrong there? Well, for more than five years nothing. Indeed, while I was storing the line endings of the files, I didn’t actually need them. The code editor I used — back then CodeMirror version 5 — automatically detected line endings itself and ensured that if you pushed in a file with CRLF, you would get out a file with CRLF, and if you pushed in a file with LF … you get the point.

In the end, I could quite literally just read the file in, pass it to the editor, let a user edit it, and on save, simply dump the entire editor contents into the file. Easy!

And because it was so easy, I literally forgot about the specifics of this implementation. Why did I decide to add \n\r as a check? Why did I leave out a single \r? I sincerely do not know why.

The Switch to CodeMirror 6, or, how splitting lines can actually be quite painful

A few months ago, in preparation for Zettlr 3.0, I migrated the code editor to the most recent version of its main editor component CodeMirror: version 6. This was an insane amount of work because basically everything had changed, but it was worth it — more stability, more efficiency, and in general a pleasure to work with. It also came with a dedicated representation for documents that worked on a line-by-line-basis. You may see where this is going.

So in order to pass documents to the new editor, I had to split them up into lines before that. And, when a user wants to save a document, I had to reattach the lines to each other using the proper line feeds. Easy, right? I mean, after all, I had the linefeed detection built in from the start, so I could just use that! How great:

// Upon read
const lines = fileContent.split(file.linefeed)
// Upon save
const newContent = editorLines.join(file.linefeed)

So I proceeded to do exactly that and ship Zettlr.

Early Warning Signs

Relatively soon after that, I received first indications by users who were unable to save files. They would press Ctrl+S but the asterisk that indicates that a file has unsaved changes didn’t disappear.

A few weeks after that, people had realized that there was always a specific error logged: “Could not apply updates to a document with wrong length.” This was cryptic, but basically it was CodeMirror telling me that whatever I had done to the document before loading it in the editor had caused a change in the document length even if there had been no real visible changes. But I had no clue at this time what it was.

So after a few weeks of frustratingly poking in the dark, I decided to go for the worst but safest solution and added a simple error message whenever that bug occurred so that users knew something was amiss and could save their work. (I was already exhausted by the fact that the TableEditor of Zettlr was prone to losing data as well and wasn’t having it.)

A Fateful Document

Whatever I did, nothing worked: Users would still continue to open issues that sometimes their files wouldn’t work. But then, a new thread emerged in which users started to home in on the actual problem: Someone compared a problematic file and found out that, after loading it and immediately saving the file, it would have changed, even though the user didn’t change anything. The user then proceeded to upload that file to the issue for me to check out.

After a few weeks of work-related stress, I was finally able to take a second look at it this week. It turns out that the common denominator of all problematic files was: Windows line endings.

Previously, I had dismissed this, because – remember – I knew that I had properly implemented linefeed detection.

Had I, though?

So I downloaded the document, loaded it, and indeed: I got the dreaded error: “Unable to save changes; copy the file contents elsewhere and restart Zettlr”! But at that point I still didn’t think about the linefeed detection.

I tried a few things, logged various data structures as the file is being passed from the file through the app and back into the file. At some point, I decided to simply log out the extracted lines of the file after having split them using the linefeed. And that’s where it clicked. Look what the file looked like:

const lines = [
    "This is the first line\r",
    "\nA third line"
]

What is important to note here is that there was an empty line in between the first and second detected one. The \r at the end of the first and the \n at the beginning of the second line gave it away. Actually, the file looked like this:

This is the first line\r\n
\r\n
A third line

Zettlr’s linefeed detection detected the linefeed to be \n\r because that sequence of characters can be found in the two line endings next to each other! Then, I took a look at my linefeed detection again and saw what was going on.

Detecting Linefeeds is Hard

When a file passes through the linefeed detection, it defaults to \n. There is nothing wrong with that, since \n is indeed the most frequent linefeed. Then, the app would check if the file contains \r\n – the Windows CRLF linefeed. In case of our document, this was true, since it is the correct linefeed for the file in question.

But then, note how there is just a second if, and not an else if: For that particular file (and any other with several blank lines) if (content.includes("\n\r")) will also be true, since \r\n\r\n includes that one in the middle. So any CRLF-terminated file that has at least one blank line in them would automatically default to a \n\r line ending — mind you, a flavor that has been out of favor since the late 1980s.

That is one of the more stupid mistakes I’ve made in my years developing the app, and it was also very consequential this time, because it actually led to data loss.

So, how could I fix that? It turns out that introducing a simple else if wasn’t going to cut it, because then you would have the same issue but reversed: A file that uses \n\r as line endings but also contains blank lines would then default to the incorrect CRLF even though the correct one was LFCR.

Since linefeeds aren’t that complicated and because I know that some Zettlr users like to tinker with old computers which probably involves loading really old files, I decided to properly support all four variations of line endings. But it turns out that it’s quite difficult. Here’s what I came up with:

export function extractLinefeed (text: string): Linefeed {
  const CR = text.includes('\r')
  const LF = text.includes('\n')
  const CRLF = text.includes('\r\n')
  const LFCR = text.includes('\n\r')

  const indexCRLF = CRLF ? text.indexOf('\r\n') : Infinity
  const indexLFCR = LFCR ? text.indexOf('\n\r') : Infinity

  if (LF && !CR) {
    return '\n' // Unix-style (Linux/macOS)
  } else if (CR && !LF) {
    return '\r' // Commodore 64 and old Apple II systems, also emails afaik
  } else if (CRLF && indexCRLF < indexLFCR) {
    return '\r\n' // Windows and MS-DOS
  } else if (LFCR && indexLFCR < indexCRLF) {
    return '\n\r' // According to Wikipedia, only Acorn BBC and RISC OS
  } else {
    return '\n' // By default, assume a simple newline
  }
}

It turns out, three lines are simply not going to cut it if you want to properly retain line endings for a file. So let’s see what we have here.

The first four lines contain logical values indicating whether one of the four possible combinations is contained in a file.
The next two contain the first index of the two two-character line feeds; or default to Infinity (relevant for later)
Then, the first and simple two checks are for Unix and Apple II/Commodore 64 systems: Here, only one symbol is used, and if the other one does not exist in a file, that’s a 100% certainty that the line ending is either CR or LF.
Next, we check for CRLF. For this to be true, CRLF must be contained in the file and the first index of the correct line ending (CRLF) must be lower than the first index of LFCR. This way, blank lines like \r\n\r\n will have the (wrong) LFCR index one position higher than the (correct) CRLF linefeed. This is also the reason for the Infinity contraption above: JavaScript returns -1 if the search string is not found in the file, so by setting it to Infinity instead, I can maintain the smaller-than operator below
Lastly, if the line feed cannot be reliably detected (e.g., if there are no newlines), default to \n.

This still produces ill-defined behavior for files in which the line endings are completely messed up, however. It will detect any of the four variations depending on circumstance and where the symbols occur in text. But on the next save, it will use only one line ending to save the file.

In addition, if there are mixed line endings, splitting by the detected linefeed will result in the mess we saw in the test file. Therefore, I instead read in files with a regular expression that matches any linefeed. In these rare cases where someone has messed up their file, simply saving an unchanged file will indeed change it, but I think at this point we’re in very low probability territory. The most important aspect was to ensure proper reading in of files with proper line endings. And, on the upshot: After saving a messed up file, it won’t be messed up any longer!

Final Remarks

Over the past seven years of developing this piece of software, I have over and over seen one pattern: What is difficult to implement is not some fancy algorithm (there are libraries for that) or special behavior (it’s actually most of the time simple), but indeed just dealing with data errors.

Detecting line feeds is one of those issues where my hobby and my full-time work meet: No matter if in a text editor or in data science, the end boss is usually data inconsistencies, and not some algorithm.

But perseverance typically pays off: Now Zettlr is probably the only editor out there that can read in and properly handle files that have been created on an Apple II or a Commodore 64. So should you have one lying around, you can finally, after more than 30 years, edit your old files graphically again!

¹ It is not quite that superfluous. Indeed, it can be very useful when working with the command line, as the \r character will be interpreted by quite a lot of terminals to mean “move the cursor back to the beginning of the same line and start overwriting it”. So it can be used to draw, for example, progress bars on the terminal.