Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.” (1)
Recently Chris Anderson wrote an article for Wired magazine called the The End of Theory. The thesis of the article in a nutshell is that the impending petabyte era of data storage signals the end of the traditional scientific method of discovery. No longer are we bound to the outdated model of observation, hypothesis and measurement. Computers (developed by Google & IBM) “can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.”
Sounds pretty cool, and logical. Our human brains can’t see patterns in datasets of petabyte size, and more importantly we don’t need to, we can let the computers do it for us. We can skip the why and get straight to the what.
There is a fascinating discussion of Anderson’s thesis on a website called The Edge. The most interesting points I found were the following (I am paraphrasing a lot of really smart people here so bear with me)
1. There is nothing new about using correlation to make predictive statements.
2. Even with computers, bias is inescapable e.g. “Data are an artifact of selection, which means they reflect an underlying hypothesis or they wouldn’t have been collected.” (2)
3. Even with exabytes of data, correlations in themselves do not answer the most ancient of human questions: why?
Ok, so at this point you might be asking yourself what any of this has to do with taxonomy? I promise I will get there.
My experience as a consultant has thrown me into some pretty large companies. What I have noticed is that some people in these companies who are attempting to manage huge amounts of data are already unknowingly subscribing to Chris Anderson’s theory.
If in your own work you haven’t heard the phrase “let’s just get Google” when talk of information management strategy and taxonomy come up… well let me assure you it’s just a matter of time. I used to think that it was just a brand awareness thing, or a simplicity one-box thing, but Anderson’s article and the ensuing discussion are making me think a little differently.
Google works really well and we love it. Interestingly, it works on the principle that Anderson heralds as the new science: lots of people either go to, or link to a specific website, therefore it is the most relevant; correlations in the vast www. Now obviously the algorithm is more complicated than that, but that’s the jist of it. Let people’s behaviour dictate the value of information; we don’t need to categorize it ourselves and we don’t need to know why it is important.
I will admit that there is a comfort in knowing that when you type in a search you will get back results that reflect the global popularity contest, and I think probably, even if it is only subconsciously, many people would like to feel that same underlying sense inside the enterprise. “I don’t care about the why, just show me the what… and more importantly show me the what that everyone else thinks is great”
Let’s step back for a second now. Taxonomy is all about the why. On the surface one might argue differently, “it’s about what goes where”, but when it comes right down to it, the real value in taxonomy comes from understanding why the what goes where.
If that sounds too theoretical, what I have just argued is one of the most common best practices in good taxonomy development, e.g. it must be driven by the needs and mental models of actual users.
So what are the implications of the petabyte era for taxonomy development?
Well in relation to the first point that I pulled from the discussion of Anderson’s article i.e. using correlation to make predictive statements is nothing new, this is also true for taxonomy development.
The technology to “auto-generate” a taxonomy from a corpus of documents has been in the market place for years now. These types of technologies use entity extraction, term frequency, and other statistical techniques to construct ready to use taxonomies… sort of. Any one who has used these types of tools knows that, while they can be a useful starting point, the pure muscle of statistical and vector analysis techniques used by these products will only, at best, reflect the bias of the data.
This brings me to the second point I mentioned i.e. bias is inescapable. Unlike hard science however, taxonomy development demands bias. As I mentioned above good taxonomy development involves more than just the content. Depending on the intended function of the taxonomy (navigation, classification, search) it will involve crafting the taxonomy structure and heuristic principals to align with user needs. These needs cannot be extracted through correlation and analytic algorithms.
Which brings us right back to the why of the matter? Taxonomy is about understanding why content is similar and dis-similar, why concepts are related, and the implications of semantics across the enterprise. Understanding why allows us to inform our information systems pro-actively, and not wait for an algorithm to tell us which content is most popular. Sure it’s nice to know what everyone else is looking at, but the value of information can not always be determined by its popularity.