For quite a long time, correlation IDs were used to map the log entries to a given HTTP request. The idea is simple: every request coming to a web application is being assigned a pseudo-random piece of text, usually a UUID, and this text is included in the logs. It can then be used to group the log entries by requests, or to isolate the log entries corresponding to a very specific request—say a one which took much longer than expected, or resulted in an HTTP 500.
While useful, the approach has its limitations. I first discovered its major limitation when trying to track the same request through a chain of API calls, that is, when an API calls another one, which potentially performs the calls to other ones. I wanted to have an overview of everything that happened for a given “root” request, independently of the APIs being called in chain, but they weren't sharing any common entry. The only way, quite painful, actually, was to map the requests performed to the requests received, with the help of a timestamp—a task which can easily go wrong.
I figured out that I could use multiple correlation IDs. The original client request would make the API generate the first correlation ID. Every HTTP request to other APIs performed by the first API would include this original correlation ID; the APIs receiving such request would keep note of this correlation ID, as well as generate their own. This way, one would end up with some sort of a tree, where every log entry would contain a trace of every request which led to this one. The problem was, however, the place. I mean, a log entry which starts with five UUIDs in a row is simply unreadable. Therefore, I never implemented the actual system back then.
More recently, I was working on a logging system where the server logs would be streamed to the client and shown in the console—a very convenient way to debug a given action. Essentially, I would, say, click on a button in the web interface, and look at the logs, both the ones generated by the client application itself, as well as the ones streamed from the APIs which were called by the client application. The feature appeared to be rather convenient, but had a huge drawback—it wasn't only showing my logs, that is the logs generated by the action I was performing, but rather all the logs from all the actions by all the users. I wasn't interested by that, and needed to find a way to filter out the actions, but there was no easy way to do it. The only way was to match the correlation IDs to the original logged request, check the IP address and match it to mine. Rather convoluted way to do the thing.
It came to my mind that I could develop a system which would be much more powerful and flexible than correlation IDs—a system of request tagging. Here it is.
Short tags vs. long IDs
I'm unsure how UUIDs became a de-facto standard for correlation IDs. This is by far, after IPv6, the most unreadable type out there, and I just don't think that I can remember that I'm looking at request d5d82581-f557-4bc9-8439-16e47e241d2a, nor do I want someone to spell it out to me by phone when I'm diagnosing a production issue.
The benefit of UUIDs is that they are unique, globally. Correlation IDs don't need to be globally unique. In fact, every log entry has a timestamp and a context—a given server application, which means that the correlation ID should be unique within a given time range and a server application. If I see the exact same ID appearing in the logs one hour later, it doesn't really matter. Agreed, it could increase a risk of a human error, but so do the UUIDs. Frankly, when I work with UUIDs, I simply look at the very first characters. If they repeat, I may confound them.
A better alternative is therefore to save space and to use a string containing five to eight pseudo-random characters, like
hPhS1. To reduce the risk of confusion, some characters such as lowercase «L» or capital «O» should be prohibited. In other words, any system dealing with those tags should assume, for instance, that
O is actually a
0 when it comes from user input.
Multiplicity of tags
The major benefit compared to correlation IDs is naturally the possibility to combine multiple tags. That is, one tag is generated by the server when receiving an HTTP request, but then, all the tags included in the
X-Correlation-Tags HTTP header would be included as well in the logs. This way, a client can provide its own correlation tag, or even multiple correlation tags.
Imagine a client application which, at some point, performs a search for something, whatever that is. It will ask the API to tag the logs with
9bnKp. The search API receiving this request would generate its own tag,
cLYNz. To decorate the searches with the information about the users, the search API will contact the persons API, including two
X-Correlation-Tags headers for two correlation tags. Persons API would also generate its proper request tag,
02dgd. Its logs would therefore contain three tags:
[02dgd, cLYNz, 9bnKp]. This way, it would appear when checking all the logs which were caused by the original request that was tagged
9bnKp by the client.
As the correlation tags can be sent by the client, this mechanism can be used to tag the logs of entire features, rather than requests. Or it can also be used to tag debugging sessions.
Say I want to know what happens when I perform a given action in the browser. I don't care about the actual HTTP requests—what I want to know is everything that happens when I do this or that.
In order to do that, the client application can give me a way to specify a custom tag to include in every request, say
AM001—“am” are my initials, and “001” is the debugging session 1.
Now, every log entry would contain my custom tag. So an API deep down the stack would log an entry, specifying the tags
[u7tYg, MQfk2, JjsEa, AM001]. Filtering those logs would be extremely easy.
Example of a log entry
Here's an example of a log as it would be shown in the browser console. Since modern browsers make it possible to stylize the text shown in the console, the client can take a list of structured log entries, and show it like this.
15:27:43 Verb. AM001 front:bookSearch searching for “Dickens.”15:27:43 Inf. AM001 vRCH6 front:fetch performing GET https://bookshelf-api.example.com/search?q=Dickens.15:27:46 Inf. AM001 vRCH6 nug8u bookshelf-api1:blueprints/book:192 received GET /search?q=Dickens from 192.168.1.58.15:27:46 Verb. AM001 vRCH6 nug8u bookshelf-api1:data/cache:25 cache missing for “all-books-of-user(9228900519362).”15:27:46 Inf. AM001 vRCH6 nug8u bookshelf-api1:data/mongo:60 retrieving the list of all books.15:27:46 Inf. AM001 vRCH6 nug8u bookshelf-api1:data/mongo:60 got 1522 book(s) in 27 ms.15:27:44 Inf. AM001 vRCH6 front:fetch GET https://bookshelf-api.example.com/search?q=Dickens ended with HTTP 200 in 195 ms.15:27:44 Verb. AM001 front:bookSearch received 8 result(s).15:27:46 Inf. Wk15L naJRD bookshelf-api1:blueprints/book:88 received POST https://bookshelf-api.example.com/read/912 from 184.108.40.206.15:27:46 Verb. AM001 vRCH6 nug8u bookshelf-api1:data/cache:25 adding cache entry with key “all-books-of-user(9228900519362),” no expiry date.15:27:46 Inf. AM001 vRCH6 nug8u bookshelf-api1:blueprints/book:192 processing ended in 188 ms.; responding with HTTP 200.
Notice, towards the end, a log entry related to another request by a different client. As it doesn't have the orange tag, it can be ignored—or the client application which shows the actual logs can discard it automatically if asked to. Also notice that the messages are not necessarily in order, nor are their timestamps. As the logs from the API take time to arrive to the client (through Server-sent events or WebSockets), they are indeed often shown after the log messages of the client itself—that's the drawback of asynchronous processing of those messages. And as the time is not necessarily synchronized between the client and the server, the timestamps may differ as well.