URLs are Surprisingly Hard

Hendrik Erz

Abstract: Have you ever wondered what happens when you type in "google.com" into your browser's address bar? It is actually a piece of genuinely amazing engineering. In this article, I talk about URLs, and why the simple translation between you and your computer is a great example for something that a few lines of "if" and "else" statements can do, but not ChatGPT.

Published on Saturday, December 7th, 2024 by Hendrik | 22 min reading time

Uniform Resource Locators, or URLs for short, are everywhere. We interact with them every day, and if you read this, you probably have already typed some URL today. We use URLs to visit websites, download files, or share links with friends and colleagues. But it’s not just us who use URLs every day. The applications we use all the time use URLs as well. Our messengers, social media apps, and many other tiny helpers that work with some online data, all communicate with some server using URLs. For example, when you view some post on Bluesky, the app will not simply open the Bluesky website and display its rendition of the post in the app. Rather, it will fetch the post contents and some metadata from an API, making the app faster and more accessible as it doesn’t have to download all the HTML and CSS that usually comes with websites.

But how do we actually decide whether something is a URL or not? I think URLs are akin to pornography: “You know it when you see it.” There is little objective evidence for what is a URL and what not. URLs are extremely malleable, and you can leave off a lot of information, without compromising your computer’s ability to make sense of it. And this is precisely the topic of today’s article: We casually use browsers and URLs every day without thinking about the work that our technology has to do to get us what we want. Indeed, any software application that wants to enable you to use URLs has to do a lot of heavy lifting and guesswork under the hood to ensure that what you type is what you get. And in doing that, our software apps – regardless of whether it’s some mobile app or a desktop program – have to use a lot of heuristics that are just shy of what we humans are capable of doing with language.

So let me take you on a journey to understand how difficult we make it for our computers to understand our intentions each day, and how well browsers handle this.

Motivation

Let us play a little game to get started. I mentioned earlier that “we know a URL when we see it.” So let’s put it to the test. Which of the following are actually valid URLs?

google.com
https://www.bsky.app
file:///Users/name/Documents/Essay.docx
https://dx.doi.org/10.1016/j.poetic.2009.11.004

Now, if you have guessed that all of these are valid URLs, you were right! Not very difficult, right? How about the following?

mailto:john.doe@example.com
tel:+1234567890
doi:10.1016/j.poetic.2009.11.004
did:plc:z72i7hdynmk6r22z27h6tvur

Did you have to think a bit? Understandable, but again: All of these are valid URLs. The first is what you’ll have to provide to make a link to open a user’s email program. The second is the same, but for calling a number — this is often handy when you visit a website on your phone, as it allows you to call someone with a tap. The third one is the official DOI URL scheme, which has been proposed 21 years ago to the IETF, albeit not being standard. Lastly, the did is not strictly speaking a URL but a URI. It describes the unique identifier of a Bluesky account. However, all URLs are also URIs, and so the last one can, in principle, also be opened by an app. Here we are gently moving towards the “hardness” that I mentioned in the title.

It turns out that, while we humans sometimes aren’t certain whether something is actually a URL, because certain forms are very infrequently used, so are our apps. Your web browser won’t open the doi-URL, for example, even though it uniquely describes a single published paper, and we know very well where it lives online.

Let’s look again at the first URL I provided, google.com. What makes you so certain that it’s a URL? In fact, if you open the browser’s developer tools, and attempt to construct a new URL() with just google.com, this function would throw an error and explain that it is not, in fact, a valid URL. How come?

Well, the simple explanation is that the corresponding standards, RFC 1738 and RFC 3986 define valid URLs as requiring a “scheme” — the “https” that you frequently see in front of many URLs. But it’s a valid URL, right? You could, right now, type “google.com” in your browser’s address bar, hit enter, and you would end up at the search engine. (But please come back afterwards.)

To understand what is going on here, and why we humans and our little machines rarely agree on what is a URL and what is not, we have to take a step back. In the following, I will first explain why this is even a problem to begin with. Then, I will elucidate how our software solves this problem – and remarkably so, as I rarely hear anybody complain that “my URLs don’t work.” Finally, I will end with a somewhat broader lesson that ties back to today’s Large Language Models (LLMs) and generative AI. (This sounds like a stretch, and probably is, but bear with me, it’ll be fun!)

Where Human-Machine-Interaction Needs a Translator

First, why is it an issue that some URLs that we identify as one without a doubt aren’t seen as a URL by our machine overlords, and vice versa? The issue begins with how we interact with them. We humans rarely type in full URLs. Instead, we are often pretty lazy and type in only the minimum required parts of a URL that is needed to guess what the actual URL should have been.

For example, when we hack “google.com” into our browser’s address bar, it knows at least one thing: It’s a browser, so you probably wanted to open a website. Since the only “scheme” your browser supports is HTTP(S), it will automatically add “https://“ in front of whatever you typed in, which turns it into a valid URL. Then, it will request whatever resource it can find there. In turn, Google’s web server will see that you requested just “google.com,” but that’s not where the search engine is. The search engine you are searching for is located at “www.google.com,” so the web server will tell your browser “Nope, these are not the webs you are looking for” and tell your browser to instead open “https://www.google.com.” And this is what will appear in your address bar after the browser has loaded Google. We don’t usually see this, as it happens in a fraction of a second, but it turns out that not one, but two machines had to work in tandem to turn our lazy “google.com” thingy into something that you (probably) wanted instead. So the next time you make a disappointed joke about how your smart bulb can’t make coffee yet, reflect on your life choices instead.

That’s already a lot of work just to help us apes properly visit websites, but the issues don’t stop there. In fact, if you’ve ever written a document, say, a blog post, and added a link manually, how would you have done it? You probably just provided “google.com,” and not the correct “https://www.google.com.” Now, try it out: Either in your note-taking app or in Word, or wherever, really, create a link but only type in “google.com.” Then, try to open this link (typically by Ctrl-clicking). Some apps will do this flawlessly — they will, indeed, assume that you probably meant a web link and silently add “https://“ (or “http://“ because Microsoft Word apparently still lives in 1995) before whatever you typed. But some apps may not. Indeed, do you know what happens to links when there is no translation from whatever you typed into what you (probably) intended? Let’s give it a test run. Here is a link to just “google.com.” Hover over it with your mouse to see what happens.

My website runs on Winter CMS, and the built-in Markdown-to-HTML-converter does not perform any link normalization,¹ so whatever I type in is what it spits out. If I type in “google.com,” it will produce a link that only consists of “google.com,” and when I type in “https://www.google.com,” this is what it will produce. However, remember that this is not a valid link. So instead your browser will provide the necessary translation to turn that thing into a valid URL. And how does it do that? Well, the RFC that I have linked above is very explicit about this: It provides a “recommended” way of how to resolve anything that doesn’t yet look like a proper link. And this heuristic will make your browser append “google.com” to whatever website you are on right now. So clicking on this link will produce a “File not found” error, because it does not, in fact, go to “https://www.google.com.” Go figure.

But why is this almost never an issue for us? Because most of our software is really good at making all of these educated guesses. Because that’s really all there is: Turning something into a valid URL is pure guesswork – plain and simple. It is just very accurate guesswork.

Using pragmatics to Resolve Links

This now brings us to the final leg of our journey, and will add insights from linguistics in here. Yes, I was going to make this about text analysis and computational linguistics. You did not really think I would let you get out of here without a detour into my PhD, did you?

To make my next point, I need to enlarge the premise a bit here. Let us leave behind the world of the browser, and enter the world of note-taking apps. Specifically, the spark that prompted me to write this article was a recent incident I had with my own app, Zettlr. In a recent update, Electron made some changes that must not concern us in detail. What is of interest is that they now enforce usage of URLs across the framework. Previously, when a user wanted to link a file, I could just send back and forth plain file paths. This is no longer possible. However, since paths are just URLs that need a file:// in front of them, I thought to myself: “Easy!” And, well, it wasn’t. The 3.5.0 update completely broke links and images across the entire app for Windows users, and I had to work a night shift to release 3.5.1 to fix that. That was not very pleasant. But it prompted me to think again about URLs and how they are used in Zettlr.

Since Zettlr is not a browser, it can’t open HTTP-URLs. But it should be able to construct them. You should be able to create some link, and Ctrl-click on it to open it in whatever is your default browser. But it doesn’t stop there. Since it’s an offline app that works on top of your files on your computer, it makes sense to let people link files as well. And how do you do that? Well, you provide a relative path to some file. And, of course, mailto: links should work as well; preferably without the user having to type the mailto:-thing every time.

If you think about it, anything can be converted into a link. But, because we’re lazy, we usually rely on our software to guess what the valid URL describing this thing is. We want our software to recognize that, if we write “google.com,” we mean “https://www.google.com.” When we type “+1234567890” we expect it to detect that we actually wanted to create a tel: link. And when something has an @ in it, we expect our computer to understand that it’s an email address.² All of this is made more difficult when we also have the option to link files on our computer. Then we additionally expect our software to detect that image.png is not a web link to https://image.png, but rather a file on the local computer relative to the current file’s path.

And this is where things get mushy. In the example of Zettlr, it takes its own, big utility function to take whatever garbage³ a user provided and turn it into a valid URL that links to something. It is one of the most heavily commented functions across the codebase. Take a look. And there is a reason for this: A lot of the magic of link resolution is based on world knowledge.

Linguists would call this pragmatics. (I am not a Linguist [IANAL], so take this paragraph with a grain of salt, I don’t want to step on anybody’s toes with definitional details.) Pragmatics are a dimension on which we interpret language. Many things don’t make sense purely from the context of some sentence. We need some context to understand it. For example, if someone tells you “I don’t like the weather” and you respond “Me neither,” it could very well be that they don’t like sun and you don’t like rain. You need some world knowledge to fully understand language (and even then, there are tons of ambiguities).

This is precisely the same problem we have when a user writes something in a place where there may be a link, but they didn’t provide a valid URL. The computer and you have completely different contexts from which you operate. You view your computer from a completely different standpoint than your software. When you work on a Markdown document and provide a link, you may think of a completely different context than what your software has available. One interesting discovery I made while fixing the aforementioned bug was that network shares work not just in a Windows context, but also on macOS, but not on Linux. Specifically, if you link to, e.g., //server/path/to/file.png, this is a no-brainer on Windows, as this whole idea of network shares is very common, especially in corporate environments.⁴ But what happens on Macs or Linux computers? Well, that is quite interesting. On Macs, you can actually resolve network shares, as those are always mounted at /Volumes/server. But on Linux you cannot know that without knowing the list of existing mounts on the computer, as Linux users can mount network shares wherever their heart desires. This means that the link //server/path/to/file.png is essentially invalid in the context of Linux. But you may – and with good reason – expect the link to a network share that you created on your Windows computer to also work on Linux.

That this situation doesn’t happen very frequently is due to very clever engineering. Or rather, accurate engineering. Let us take what the utility function in Zettlr does to convert whatever it finds into a valid URL.

First, it checks whether something is already a valid, full URL. In that case, it simply returns that. This also ensures that, if some partial URL you provide is ambiguous, you can make it unambiguous by actually writing the full URL including scheme and all that. This also makes the following logic simpler, as the function can expect that the user does some “reasonable” (whatever this may mean) work to ensure that the software produces valid URLs. Then, it will check if it is an email address. If so, it will add the mailto:-scheme, if necessary, and return it as a valid URL. This is because in the overwhelming majority of all cases, URLs to websites no longer contain @-symbols, so it’s a strong indicator that something is an email. Next, it checks if the thing starts with //. This is usually a good indicator that the user intended to link to a network share (if we omit something like https://, we usually omit it completely, and don’t leave the // in). Third, it checks whether the thing starts with a dot (.), since that is typically a very good indicator that the user didn’t want to link to some website, but rather a local file. Lastly, it checks whether the end of a URL looks like a valid file extension, e.g., png, jpeg, or md. If so, it will assume that you intended to link a local file, not some website. For image extensions, this works unambiguous since those filename extensions aren’t (yet) registered top-level domains (TLD), but not necessarily with the filename extension for Markdown files, md, since that is also the TLD for the Republic of Moldova. In the latter case, a simple remedy is to, well, provide the https:// first, to indicate to the app that you indeed wanted to open a Moldovan website, not a file on your computer.

This is why this URL resolving almost always works flawlessly. However, naturally there are always exceptions. There are some contexts where this specific approach to resolving URLs will always yield wrong results. These form the limits of what one can do with some assumptions and knowledge about how the world works (or, rather, how humans think in this case). But at least in the case of URLs, it highlights how well this imputation works: we almost never think about these difficulties, because it “just works.”

Clever Engineering Trumps Generative AI in Common Use-Cases

This leads me to the final argument I want to make today. It has become increasingly common for people to suggest generative AI, or even “ChatGPT” as a valid solution to an absolutely mind-boggling range of problems. From sentiment classification through survey response imputation to text labelling. To almost any difficulty, it appears, “ChatGPT” is a valid answer. But I think with the issue of URLs, I have found one (of the many) instances in which ChatGPT proves what it is: A simulacrum of a person, something that pretends to be reasonably human and thus falling short of both being human and being a machine. The creators of generative AI sacrificed machine precision to emulate human reasoning.

I really hope that this trend stops at some point, because – while generative AI is absolutely useful for many things – generative AI does not work for everything. And if you already had this thought a few months ago and let ChatGPT classify URLs without double-checking, I may have bad news for you.

Following an intuition, I asked my local Llama v3.2 3B instruct model whether the eight examples above are valid URLs, and it failed spectacularly. It did not make all that many errors, but the way it failed reveals that generative AI has been trained to apply human-like reasoning, and not machine-like reasoning. Below, I use Comment: to explain why the model likely generated its response.

Prompt: Is "google.com" a valid URL?

Response: Yes, "google.com" is a valid URL (Uniform Resource Locator). It's the top-level domain for Google's website.

Comment: This shows how the model has applied human-like reasoning, and we would apply the exact same reasoning, even though it is entirely wrong in this case.

Prompt: Is "https://www.bsky.app" a valid URL?

Response: Yes, "https://www.bsky.app" is a valid URL. The use of HTTPS (Hypertext Transfer Protocol Secure) and the ".app" top-level domain indicate that it's a secure website.

Comment: I love how the model assumed that, because it’s a .app-domain, it’s more secure.

Prompt: Is "file:///Users/name/Documents/Essay.docx" a valid URL?

Response: No, "file:///Users/name/Documents/Essay.docx" is not a valid URL. It appears to be an absolute path on a local file system (using the Mac OS X syntax), rather than a networked resource that can be accessed over HTTP or HTTPS.

Comment: This was an interesting error, since the URL was fully qualified, but I think if one is not well-versed in knowing what the file-protocol is, one might’ve come to the same conclusion. The bigger error here was interestingly that it’s not “Mac OS X syntax,” but rather “UNIX syntax.”

Prompt: Is "https://dx.doi.org/10.1016/j.poetic.2009.11.004" a valid URL?

Response: Yes, "https://dx.doi.org/10.1016/j.poetic.2009.11.004" is a valid URL. The .doi extension indicates that it's a Digital Object Identifier (DOI), which is a standardized way to identify and cite digital objects, such as academic papers or articles.

Prompt: Is "mailto:john.doe@example.com" a valid URL?

Response: Yes, "mailto:john.doe@example.com" is a valid URL. It's an email address in the mailto protocol, which allows users to send emails directly from their web browser or other applications that support this scheme.

Prompt: Is "tel:+1234567890" a valid URL?

Response: Yes, "tel:+1234567890" is a valid URL. It's an email address in the telnet protocol, which allows users to make phone calls directly from their web browser or other applications that support this scheme.

Prompt: Is "doi:10.1016/j.poetic.2009.11.004" a valid URL?

Response: No, "doi:10.1016/j.poetic.2009.11.004" is not a valid URL in the classical sense.

The DOI (Digital Object Identifier) scheme is typically used to identify and cite digital objects, but it's not an HTTP or HTTPS protocol that can be accessed directly over the web.

However, some websites may use DOIs as part of their URLs, often with additional text like "doi.org" or "dx.doi.org". In this case, a valid URL would typically include both the DOI and one of these domain names.

Comment: This was an interesting response. It is, in fact, a valid URL, but the model generated an accurate description that it is not a valid URL in the classical sense – the DOI scheme has not yet been ratified by the IETF, and even though one can resolve DOIs via the DOI resolver unambiguously, browsers wouldn’t be able to open a DOI-URL.

Prompt: Is "did:plc:z72i7hdynmk6r22z27h6tvur" a valid URL?

Response: Yes, "did:plc:z72i7hdynmk6r22z27h6tvur" is a valid URL. The DID (Decentralized Identifiers) scheme is an emerging standard for unique identifiers that can be used to identify individuals or organizations in decentralized networks.

The format you provided, starting with "did:", followed by the issuer and identifier, is consistent with the DID specification.

Comment: This was also interesting. It is a decentralized identifier, as the model correctly replied, but it is not, in fact, a valid URL (in the classical sense), and should be rather classified as URN.

Now, if you think to yourself “Well, but ChatGPT is certainly better!” I gave it a try, too. It did correctly say that the file-URL was correct, which Llama 3.2 did not do. However, again it did state that the doi:-URL was incorrect, which is wrong. On the other hand, the did:-URL was, according to ChatGPT, true.

So … what happened here? I think it’s pretty obvious. Both models applied a reasoning that the average human would also do. But they – as most others – made three key mistakes:

They assumed URLs must point to something existing. However, that is nowhere mentioned in the specification, and it would be unreasonable. A URL is a uniform resource locator, meaning it will provide something that helps locate some resource. But there is always interpretation involved. For example, when you type in https://www.google.com, this is a valid locator, but there must also be a web server that double-checks whether this resource actually exists. Had we solved this problem, there were no 404-errors anymore.
They assumed that, for a URL to be valid, it must use an established protocol, but this is wrong. Even though the doi:-scheme is not yet standardized, it is a valid URL, as the specification does not say anything about which schemes are allowed. The concept of “deep links” illustrates this well: Developers can register a custom protocol (e.g., zettlr://) to allow for deep linking into apps and performing actions remotely, which enables automation mostly on smartphones, but also on the computer.
They confused URN with URL. The doi:-thing I provided is a locator, whereas the did:-thing follows the more broad URN standard and is not strictly speaking a URL.

In short, if you ever used generative AI to perform such a simple task, you probably got it wrong.

Final Thoughts

I believe that URLs are some form of litmus test that helps highlight that usually we don’t need generative AI to solve problems. It seems that ever since the inception of ChatGPT, people have stopped trying to think about fast but smart solutions to simple problems, and instead defaulted to suggesting generative AI for everything.

But by using generative AI, as we have seen, we not only receive wrong answers, but, moreover, we overlook the very smart and almost magical engineering that goes into turning something as abstract as “abc.xyz” into a valid URL. Using LLMs does not challenge our thinking and does not help us understand the world around us. Also, it uses exorbitant resources in situations where some simple if- and else-conditions suffice.

There is a reason why the gzip-compression algorithm beats Large Language Models in some respects. I implore you to go look for the small but awesome contributions that our everyday software make on our lives. How about understanding two-factor-authentication, for example?

In any case, I hope that you could learn a thing or two today. Next time you mindlessly hack some letters into your browser’s address bar, think about how many computers have to work in tandem to deliver you exactly the website you were looking for – almost always.

¹ Note that, at some point, the website may actually perform this link normalization. So if the link works as expected instead of leading you to a 404, thank the developers for their cool new feature!
² There are entire jokes about how difficult it is to detect whether something is an email. The lazy solution is to simply check for the presence of an @, but then there are URLs that are not emails which contain an @, such as https://user:password@localhost/path/to/file. Then, one can use regular expressions to do so, but this runs the risk to also classify wrong emails such as .johndoe@examplecom. as emails, and in fact it’s a big rabbit hole in itself. This whole “How to detect an email”-problem is a sub-issue of the problem I am describing in this article.
³ I’m saying “garbage” here cheekily, while including me in the group of addressees. I expect the same as you, but, from a purely technical standpoint, what we collectively provide to software and expect it to magically find a “link” to is objectively garbage. And because I have the misfortune to now belong to the group of people who have to write algorithms to help software do exactly this magic work, I reserve the right to be moderately grumpy about this fact.
⁴ Do you know what the URL specification has to say about network share URLs? “A relative reference that begins with two slash characters is termed a network-path reference; such references are rarely used.” (RFC 3986, § 4.2)