OASIS Approves UIMA – the first standard for accessing Unstructured Information

Early last month, OASIS announced the approval of the Unstructured Information Management Architecture Version 1.0.  This standard creates an open method for accessing unstructured information – that is, any information that is created by and for people, and is not inherently machine-readable (e.g., not data).  UIMA can potentially become very important since it provides a standard mechanism to exchange metadata for all types of unstructured content – documents, web pages, email, voice, images and video.

As we all have heard repeated in the marketing messages of every content-related software company, over 80% of the data we run our businesses on is unstructured.  In our business we help our clients tame their mountains of content by classifying it.  Often we rely on technologies like auto-classification, entity extraction, and other analytics to tag content with metadata.  Metadata helps us bring structure – and in turn semantics or meaning – to unstructured content. 

Of course, each of these systems has its own API and its own methods of expressing the metadata it produces or consumes.  This is where UIMA comes in.  In the introduction to the UIMA standard, the team at OASIS describes a typical workflow in which various analytics packages may need to interact:

An example of assigning semantics includes labeling regions of text in a text document with appropriate XML tags that identify the names of organizations or products. Another example may extract elements of a document and insert them in the appropriate fields of a relational database, or use them to create instances of concepts in a knowledgebase. Another example may analyze a voice stream and tag it with the information explicitly identifying the speaker, or identifying a person or a type of physical object in a series of video frames.

Analytics are typically reused and combined together in different flows to perform application-specific aggregate analyses. For example, in the analysis of a document the first analytic may simply identify and label the distinct tokens or words in the document. The next analytic might identify parts of speech, the third might use the output of the previous two to more accurately identify instances of persons, organizations and the relationships between them.

UIMA is unique in that it enables the interoperability of analytics across platforms, frameworks, and content types (text, audio, video, etc.).  If various vendors opt to support UIMA, it would enable “best-of-breed” analytics workflows to be created by linking multiple vendors’ components into solutions.

The areas covered by UIMA are described in the standard as follows:

The UIMA specification defines platform-independent data representations and interfaces for text and multi-modal analytics. The principal objective of the UIMA specification is to support interoperability among analytics.  This objective is subdivided into the following four design goals:

  1. Data Representation. Support the common representation of artifacts and artifact metadata independently of artifact modality and domain model and in a way that is independent of the original representation of the artifact.
  2. Data Modeling and Interchange. Support the platform-independent interchange of analysis data (artifact and its metadata) in a form that facilitates a formal modeling approach and alignment with existing programming systems and standards.
  3. Discovery, Reuse and Composition. Support the discovery, reuse and composition of independently-developed analytics.
  4. Service-Level Interoperability. Support concrete interoperability of independently developed analytics based on a common service description and associated SOAP bindings.

Here at Earley & Associates we will keep our eyes and ears tuned to understand which vendors will incorporate UIMA into their product offerings.  IBM and EMC are on the UIMA working group, and much of the project was based upon an open source project at IBM (IBM’s UIMA was an Alpha Works project).  IBM already touts support for UIMA in several of its information management products, including eDiscovery Analyzer, Content Analyzer, and OmniFind.  There is an incubator project for UIMA-based open source software hosted at the Apache Software Foundation.  As part of the eligibility process for becoming an OASIS standard, successful use of UIMA was verified by IBM, Amsoft, Carnegie Mellon University, Thomson Reuters, and the University of Tokyo. 

We look forward to learning of other vendors who will adopt this standard – it will sure make our jobs easier, and those of others who must integrate systems that produce and consume metadata.

Search Solutions Jumpstart Series

From time to time we organize a free educational conference call series on search, taxonomy or content managment. Next month, we’ll be running our Search Series.

Register at http://www.earley.com/Searchjumpstart2008.asp

Continue reading