Organizing information: from rigid structures to the lack of information organization

Arseni Mourzenko
Founder and lead developer
177
articles
April 29, 2015
Tags: tagging 4 data-structures 2

When or­ga­niz­ing in­for­ma­tion such as files, two prob­lems arise every time: the un­cer­tain­ty that the cur­rent struc­ture is ap­pro­pri­ate, and the cas­es which are out­side the struc­ture.

Those two prob­lems are very close and lead to sim­i­lar re­sults, such as the “Mis­cel­la­neous” di­rec­to­ry which con­tains every­thing which didn't found its place else­where. Those can be the en­tries which are tru­ly out­side any con­ceiv­able struc­ture giv­en their spe­cif­ic na­ture, their par­tic­u­lar con­tent, or their ori­gin, but when the ini­tial struc­ture is bad­ly de­signed, valid files can end up in the “Mis­cel­la­neous” di­rec­to­ry as well.

There are nu­mer­ous ex­am­ples of that. For ex­am­ple, when sort­ing pho­tos, one can try to dis­tin­guish them by events, such as “Trip to Lon­don” and “Vis­it of the Lou­vre mu­se­um.” Quick­ly, the per­son will find that there are many pho­tos which, while valu­able, are out­side any par­tic­u­lar event, such as a sin­gle pho­to of a strange bird the per­son no­ticed on his bal­cony the oth­er day. The per­son can in­stead try to dis­tin­guish the pho­tos by date: those ones were tak­en on 15th of July, while these were tak­en on 18th of July. Such or­ga­ni­za­tion will be not only quite use­less, but also make a dif­fer­ence where no dif­fer­ence should ex­ist. For ex­am­ple, it doesn't make sense to put pho­tos of a cel­e­bra­tion which start­ed in the evening and con­tin­ued through the night in two di­rec­to­ries.

Bad struc­tures al­ways lead to a lot of en­tries which are put into “Mis­cel­la­neous” di­rec­to­ry, or which are put in a di­rec­to­ry where they feel be­ing in a wrong place. Tree-based struc­tures lead to a high num­ber of such cas­es near­ly every time (ac­tu­al­ly, I've nev­er seen any case where a tree-based struc­ture would work well to sort uni­form re­sources). This is caused by the in­her­ent mu­tu­al­ly-ex­clu­sive na­ture the tree struc­ture im­pos­es, in oth­er words the fact that a leaf or a branch be­longs to one and one only branch is an im­por­tant ob­sta­cle for the or­ga­ni­za­tion of in­for­ma­tion. The worst ex­am­ple I have in mind is the pass­word man­ag­er we use in­ter­nal­ly, where most pass­words end­ed up in “Mis­cel­la­neous” fold­er, but every oth­er us­age of a tree was harm­ful too.

Tag­ging helps. By mak­ing re­mov­ing the mu­tu­al­ly-ex­clu­sive na­ture of a tree, tag­ging makes it pos­si­ble to solve the biggest prob­lem of a tree—the sit­u­a­tion where a leaf may find its place in sev­er­al branch­es at once. Ac­tu­al­ly, there are two un­der­ly­ing cas­es which are prob­lem­at­ic with a tree: the case where a leaf can ac­tu­al­ly be placed in any of the mul­ti­ple branch­es, none be­ing more ap­pro­pri­ate than oth­er ones, and the case where a leaf should be in sev­er­al branch­es at once.

Un­for­tu­nate­ly, tag­ging doesn't solve all the prob­lems. There are still en­tries which are com­plete­ly out­side the sys­tem. For in­stance, this blog ar­ti­cle, by its na­ture, couldn't find its way into the pre­ex­ist­ing set of tags. By cre­at­ing an ad­di­tion­al tag, I hid the prob­lem, but the fact re­mains that the ar­ti­cle is too dif­fer­ent to fit into tags in­tend­ed for soft­ware pro­duc­tion ar­ti­cles.

When an en­try is too spe­cial, ei­ther the en­try should be shaped in a form which makes it pos­si­ble to fit into the pre­ex­ist­ing sys­tem, or the sys­tem it­self should change.

This leads to the pri­ma­ry rec­om­men­da­tion of this ar­ti­cle: al­ways de­sign both the or­ga­ni­za­tion struc­ture and the struc­ture frame­work so that changes are as pain­less as pos­si­ble. This means that:

Ul­ti­mate­ly, users shouldn't even be aware of the way in­for­ma­tion is or­ga­nized. It should sim­ply work, auto-mag­i­cal­ly. A promis­ing field in this do­main is full-text search and search based on meta. I've al­ready dis­cussed how search ca­pa­bil­i­ty is su­pe­ri­or to tag­ging in a case of this blog, and the case study can be ap­plied as well to mu­sic host­ing web­sites or pass­words man­agers. While the fact that ba­sic in­dex­ing en­gines cou­pled with high­ly com­plex search sys­tems pro­duce search ca­pa­bil­i­ty of a high­er de­gree com­pared to man­u­al tag­ging is not that sur­pris­ing (and is ex­plained most­ly by the lack of abil­i­ty of a hu­man brain to cat­e­go­rize en­tries in an ob­vi­ous way which makes sense to every­one). What is more sur­pris­ing is that even el­e­men­tary full-text or meta-based search (with ba­sic in­dex­ing) is still more ca­pa­ble than tag­ging or cat­e­go­riz­ing with­in a tree. Less work leads to bet­ter re­sults—not bad.