Correlation tags

Arseni Mourzenko
Founder and lead developer
177
articles
August 4, 2022
Tags: logging 1 tagging 4

For quite a long time, cor­re­la­tion IDs were used to map the log en­tries to a giv­en HTTP re­quest. The idea is sim­ple: every re­quest com­ing to a web ap­pli­ca­tion is be­ing as­signed a pseu­do-ran­dom piece of text, usu­al­ly a UUID, and this text is in­clud­ed in the logs. It can then be used to group the log en­tries by re­quests, or to iso­late the log en­tries cor­re­spond­ing to a very spe­cif­ic re­quest—say a one which took much longer than ex­pect­ed, or re­sult­ed in an HTTP 500.

While use­ful, the ap­proach has its lim­i­ta­tions. I first dis­cov­ered its ma­jor lim­i­ta­tion when try­ing to track the same re­quest through a chain of API calls, that is, when an API calls an­oth­er one, which po­ten­tial­ly per­forms the calls to oth­er ones. I want­ed to have an overview of every­thing that hap­pened for a giv­en “root” re­quest, in­de­pen­dent­ly of the APIs be­ing called in chain, but they weren't shar­ing any com­mon en­try. The only way, quite painful, ac­tu­al­ly, was to map the re­quests per­formed to the re­quests re­ceived, with the help of a time­stamp—a task which can eas­i­ly go wrong.

I fig­ured out that I could use mul­ti­ple cor­re­la­tion IDs. The orig­i­nal client re­quest would make the API gen­er­ate the first cor­re­la­tion ID. Every HTTP re­quest to oth­er APIs per­formed by the first API would in­clude this orig­i­nal cor­re­la­tion ID; the APIs re­ceiv­ing such re­quest would keep note of this cor­re­la­tion ID, as well as gen­er­ate their own. This way, one would end up with some sort of a tree, where every log en­try would con­tain a trace of every re­quest which led to this one. The prob­lem was, how­ev­er, the place. I mean, a log en­try which starts with five UUIDs in a row is sim­ply un­read­able. There­fore, I nev­er im­ple­ment­ed the ac­tu­al sys­tem back then.

More re­cent­ly, I was work­ing on a log­ging sys­tem where the serv­er logs would be streamed to the client and shown in the con­sole—a very con­ve­nient way to de­bug a giv­en ac­tion. Es­sen­tial­ly, I would, say, click on a but­ton in the web in­ter­face, and look at the logs, both the ones gen­er­at­ed by the client ap­pli­ca­tion it­self, as well as the ones streamed from the APIs which were called by the client ap­pli­ca­tion. The fea­ture ap­peared to be rather con­ve­nient, but had a huge draw­back—it wasn't only show­ing my logs, that is the logs gen­er­at­ed by the ac­tion I was per­form­ing, but rather all the logs from all the ac­tions by all the users. I wasn't in­ter­est­ed by that, and need­ed to find a way to fil­ter out the ac­tions, but there was no easy way to do it. The only way was to match the cor­re­la­tion IDs to the orig­i­nal logged re­quest, check the IP ad­dress and match it to mine. Rather con­vo­lut­ed way to do the thing.

It came to my mind that I could de­vel­op a sys­tem which would be much more pow­er­ful and flex­i­ble than cor­re­la­tion IDs—a sys­tem of re­quest tag­ging. Here it is.

Short tags vs. long IDs

I'm un­sure how UUIDs be­came a de-fac­to stan­dard for cor­re­la­tion IDs. This is by far, af­ter IPv6, the most un­read­able type out there, and I just don't think that I can re­mem­ber that I'm look­ing at re­quest d5d82581-f557-4bc9-8439-16e47e241d2a, nor do I want some­one to spell it out to me by phone when I'm di­ag­nos­ing a pro­duc­tion is­sue.

The ben­e­fit of UUIDs is that they are unique, glob­al­ly. Cor­re­la­tion IDs don't need to be glob­al­ly unique. In fact, every log en­try has a time­stamp and a con­text—a giv­en serv­er ap­pli­ca­tion, which means that the cor­re­la­tion ID should be unique with­in a giv­en time range and a serv­er ap­pli­ca­tion. If I see the ex­act same ID ap­pear­ing in the logs one hour lat­er, it doesn't re­al­ly mat­ter. Agreed, it could in­crease a risk of a hu­man er­ror, but so do the UUIDs. Frankly, when I work with UUIDs, I sim­ply look at the very first char­ac­ters. If they re­peat, I may con­found them.

A bet­ter al­ter­na­tive is there­fore to save space and to use a string con­tain­ing five to eight pseu­do-ran­dom char­ac­ters, like F38cX, 40sUY, and hPhS1. To re­duce the risk of con­fu­sion, some char­ac­ters such as low­er­case «L» or cap­i­tal «O» should be pro­hib­it­ed. In oth­er words, any sys­tem deal­ing with those tags should as­sume, for in­stance, that O is ac­tu­al­ly a 0 when it comes from user in­put.

Mul­ti­plic­i­ty of tags

The ma­jor ben­e­fit com­pared to cor­re­la­tion IDs is nat­u­ral­ly the pos­si­bil­i­ty to com­bine mul­ti­ple tags. That is, one tag is gen­er­at­ed by the serv­er when re­ceiv­ing an HTTP re­quest, but then, all the tags in­clud­ed in the X-Correlation-Tags HTTP head­er would be in­clud­ed as well in the logs. This way, a client can pro­vide its own cor­re­la­tion tag, or even mul­ti­ple cor­re­la­tion tags.

Imag­ine a client ap­pli­ca­tion which, at some point, per­forms a search for some­thing, what­ev­er that is. It will ask the API to tag the logs with 9bnKp. The search API re­ceiv­ing this re­quest would gen­er­ate its own tag, cLYNz. To dec­o­rate the search­es with the in­for­ma­tion about the users, the search API will con­tact the per­sons API, in­clud­ing two X-Correlation-Tags head­ers for two cor­re­la­tion tags. Per­sons API would also gen­er­ate its prop­er re­quest tag, 02dgd. Its logs would there­fore con­tain three tags: [02dgd, cLYNz, 9bnKp]. This way, it would ap­pear when check­ing all the logs which were caused by the orig­i­nal re­quest that was tagged 9bnKp by the client.

Cus­tom tag­ging

As the cor­re­la­tion tags can be sent by the client, this mech­a­nism can be used to tag the logs of en­tire fea­tures, rather than re­quests. Or it can also be used to tag de­bug­ging ses­sions.

Say I want to know what hap­pens when I per­form a giv­en ac­tion in the brows­er. I don't care about the ac­tu­al HTTP re­quests—what I want to know is every­thing that hap­pens when I do this or that.

In or­der to do that, the client ap­pli­ca­tion can give me a way to spec­i­fy a cus­tom tag to in­clude in every re­quest, say AM001—“am” are my ini­tials, and “001” is the de­bug­ging ses­sion 1.

Now, every log en­try would con­tain my cus­tom tag. So an API deep down the stack would log an en­try, spec­i­fy­ing the tags [u7tYg, MQfk2, JjsEa, AM001]. Fil­ter­ing those logs would be ex­treme­ly easy.

Ex­am­ple of a log en­try

Here's an ex­am­ple of a log as it would be shown in the brows­er con­sole. Since mod­ern browsers make it pos­si­ble to styl­ize the text shown in the con­sole, the client can take a list of struc­tured log en­tries, and show it like this.

15:27:43 Verb. AM001 front:bookSearch searching for “Dickens.”
15:27:43 Inf.  AM001 vRCH6 front:fetch performing GET https://bookshelf-api.example.com/search?q=Dickens.
15:27:46 Inf.  AM001 vRCH6 nug8u bookshelf-api1:blueprints/book:192 received GET /search?q=Dickens from 192.168.1.58.
15:27:46 Verb. AM001 vRCH6 nug8u bookshelf-api1:data/cache:25 cache missing for “all-books-of-user(9228900519362).”
15:27:46 Inf.  AM001 vRCH6 nug8u bookshelf-api1:data/mongo:60 retrieving the list of all books.
15:27:46 Inf.  AM001 vRCH6 nug8u bookshelf-api1:data/mongo:60 got 1522 book(s) in 27 ms.
15:27:44 Inf.  AM001 vRCH6 front:fetch GET https://bookshelf-api.example.com/search?q=Dickens ended with HTTP 200 in 195 ms.
15:27:44 Verb. AM001 front:bookSearch received 8 result(s).
15:27:46 Inf.  Wk15L naJRD bookshelf-api1:blueprints/book:88 received POST https://bookshelf-api.example.com/read/912 from 107.191.46.31.
15:27:46 Verb. AM001 vRCH6 nug8u bookshelf-api1:data/cache:25 adding cache entry with key “all-books-of-user(9228900519362),” no expiry date.
15:27:46 Inf.  AM001 vRCH6 nug8u bookshelf-api1:blueprints/book:192 processing ended in 188 ms.; responding with HTTP 200.

No­tice, to­wards the end, a log en­try re­lat­ed to an­oth­er re­quest by a dif­fer­ent client. As it doesn't have the or­ange tag, it can be ig­nored—or the client ap­pli­ca­tion which shows the ac­tu­al logs can dis­card it au­to­mat­i­cal­ly if asked to. Also no­tice that the mes­sages are not nec­es­sar­i­ly in or­der, nor are their time­stamps. As the logs from the API take time to ar­rive to the client (through Serv­er-sent events or Web­Sock­ets), they are in­deed of­ten shown af­ter the log mes­sages of the client it­self—that's the draw­back of asyn­chro­nous pro­cess­ing of those mes­sages. And as the time is not nec­es­sar­i­ly syn­chro­nized be­tween the client and the serv­er, the time­stamps may dif­fer as well.