Integrating Taxonomy with CMS – Book Chapter in Publication

The final draft has been submitted… Mark your calendars…

The Information Management Best Practices 2009 book is going to publication this week, in hopes of being ready for launch at the J.Boye Conference in Aarhus, Denmark, Nov 2-4. I’ll be there, giving a talk on SharePoint IA, but also to lend a hand with the book launch activities.

I’m proud to have a chapter in this book, with co-authors Seth EarleyCharlie Gray (CMS & Taxonomy Strategist, Motorola), on one of our most in-depth and successful projects – integrating taxonomy with CMS at Motorola. The best practice covers the steps below in great detail, offering practical advice and screenshots from the actual implementation at Motorola.


  • Step 1: Educate Stakeholders on Taxonomy
  • Step 2: Bring a Taxonomy Expert onto your CMS Implementation Team
  • Step 3: Determine Functional Requirements Continue reading

Social Tagging – Questions Answered on Correction Tools and Vendors

A few weeks ago, I had the pleasure of giving a presentation on taxonomy vs. folksonomy in the enterprise to the Deloitte Social Tagging & Taxonomy Community of Practice, thanks to an invitation by fellow taxonomy enthusiasts Annie Wang and Lee Romero.

It was a fun presentation (a variation on this talk) and the audience asked some great questions afterwards. I was only able to answer a couple of questions before time ran out, so I offered to answer the rest on my blog. Here are the additional questions & answers:

1. Are there tools for auto-correcting social tags?

I had mentioned the idea that folksonomies are considered to be “self-correcting” or self-tuning – through volume of tags and users, anomalies (like single-use tags, misspellings, etc.) tend to be pushed to the side and the majority will trend towards correct/useful tags.This is an idea that I picked up from a whitepaper on social tagging by Oracle:

All social input strategies rely on the good-graces of well-intentioned users habituated to provide input over time to succeed…  Social strategies will self-correct for this problem over time under the presumption that more users than not will provide “good” information.

While this is the case on the web, where there are millions of users and tags, it will not likely occur as easily or quickly in the reduced scope of the enterprise, where you have a tiny fraction of this volume. So the question asks whether there are tools available to help encourage good tags by auto-correcting things like spelling mistakes, plural forms, etc.

The short answer is…. not really.

Continue reading

OASIS Approves UIMA – the first standard for accessing Unstructured Information

Early last month, OASIS announced the approval of the Unstructured Information Management Architecture Version 1.0.  This standard creates an open method for accessing unstructured information – that is, any information that is created by and for people, and is not inherently machine-readable (e.g., not data).  UIMA can potentially become very important since it provides a standard mechanism to exchange metadata for all types of unstructured content – documents, web pages, email, voice, images and video.

As we all have heard repeated in the marketing messages of every content-related software company, over 80% of the data we run our businesses on is unstructured.  In our business we help our clients tame their mountains of content by classifying it.  Often we rely on technologies like auto-classification, entity extraction, and other analytics to tag content with metadata.  Metadata helps us bring structure – and in turn semantics or meaning – to unstructured content. 

Of course, each of these systems has its own API and its own methods of expressing the metadata it produces or consumes.  This is where UIMA comes in.  In the introduction to the UIMA standard, the team at OASIS describes a typical workflow in which various analytics packages may need to interact:

An example of assigning semantics includes labeling regions of text in a text document with appropriate XML tags that identify the names of organizations or products. Another example may extract elements of a document and insert them in the appropriate fields of a relational database, or use them to create instances of concepts in a knowledgebase. Another example may analyze a voice stream and tag it with the information explicitly identifying the speaker, or identifying a person or a type of physical object in a series of video frames.

Analytics are typically reused and combined together in different flows to perform application-specific aggregate analyses. For example, in the analysis of a document the first analytic may simply identify and label the distinct tokens or words in the document. The next analytic might identify parts of speech, the third might use the output of the previous two to more accurately identify instances of persons, organizations and the relationships between them.

UIMA is unique in that it enables the interoperability of analytics across platforms, frameworks, and content types (text, audio, video, etc.).  If various vendors opt to support UIMA, it would enable “best-of-breed” analytics workflows to be created by linking multiple vendors’ components into solutions.

The areas covered by UIMA are described in the standard as follows:

The UIMA specification defines platform-independent data representations and interfaces for text and multi-modal analytics. The principal objective of the UIMA specification is to support interoperability among analytics.  This objective is subdivided into the following four design goals:

  1. Data Representation. Support the common representation of artifacts and artifact metadata independently of artifact modality and domain model and in a way that is independent of the original representation of the artifact.
  2. Data Modeling and Interchange. Support the platform-independent interchange of analysis data (artifact and its metadata) in a form that facilitates a formal modeling approach and alignment with existing programming systems and standards.
  3. Discovery, Reuse and Composition. Support the discovery, reuse and composition of independently-developed analytics.
  4. Service-Level Interoperability. Support concrete interoperability of independently developed analytics based on a common service description and associated SOAP bindings.

Here at Earley & Associates we will keep our eyes and ears tuned to understand which vendors will incorporate UIMA into their product offerings.  IBM and EMC are on the UIMA working group, and much of the project was based upon an open source project at IBM (IBM’s UIMA was an Alpha Works project).  IBM already touts support for UIMA in several of its information management products, including eDiscovery Analyzer, Content Analyzer, and OmniFind.  There is an incubator project for UIMA-based open source software hosted at the Apache Software Foundation.  As part of the eligibility process for becoming an OASIS standard, successful use of UIMA was verified by IBM, Amsoft, Carnegie Mellon University, Thomson Reuters, and the University of Tokyo. 

We look forward to learning of other vendors who will adopt this standard – it will sure make our jobs easier, and those of others who must integrate systems that produce and consume metadata.

MOSS 2007 Requirements Gathering: Fast and Focused

Since Microsoft Office SharePoint Server is a mature platform for collaboration, content management and portals, companies can implement the package without much planning or even requirements gathering. Too often, the IT department is assigned the task of technically implementing SharePoint, with little context for its use or its potential value to the organization. The individuals in Business Units or Departments, who will use the system, are kept in the dark about the plans and the functionality of SharePoint. Once IT is satisfied that MOSS is technically stable, it rolls the package out to users with little training or follow-up. This approach rarely succeeds.

In this post, I want to examine how to set the foundation for a successful SharePoint implementation by starting with a clear understanding of user requirements and the business results stakeholders want to achieve. Governance, Site construction, etc. can wait until there is a base level of understanding of the business objectives and user requirements. Continue reading

SharePoint 2007 – Implementing and Managing Taxonomy

We’ve been doing a lot of work with SharePoint lately so I thought I’d put together a quick post on some approaches to implementing taxonomies in the new version. As you may or may not know, MOSS 2007 (or Microsoft Office SharePoint Server) is quickly becoming the new platform of choice for many organizations. This newer version of the application is being leveraged in the development of corporate Intranets, Extranets and even public facing Internet websites, providing information workers with enhanced collaboration and document management capability.

With the exponential growth of implementations worldwide (MOSS is the fastest growing server product in the history of the company) come greater challenges and opportunities for improving knowledge management and information access within the enterprise. The need for consistent organizing principles across enterprise information is of ever increasing importance and, when done correctly, can result in leaps and bounds in employee productivity.

Before we get to any of the details however, let’s remind ourselves that the purpose of building and maintaining taxonomies is to improve the findability of information by:

Continue reading

The Popularity Contest: Taxonomy Development in the Petabyte Era

Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.” (1)

Recently Chris Anderson wrote an article for Wired magazine called the The End of Theory. The thesis of the article in a nutshell is that the impending petabyte era of data storage signals the end of the traditional scientific method of discovery. No longer are we bound to the outdated model of observation, hypothesis and measurement. Computers (developed by Google & IBM) “can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.”

Continue reading

Wordmap and Taxonomy Management

Here is a terrific article written for Content Wrangler.