SharePoint Content Structure – Let a thousand content types bloom?

“How many content types should you have?”

This is the question that came up in a conference call last week on SharePoint architecture. This organization had implemented their corporate portal on SharePoint 2007 and was interested in going forward with more portal sites but had some concerns about the approach to information architecture they had undertaken.

I answered what I would answer no matter what technology it was – “Only as many as you really need to implement the appropriate level of metadata, workflow and templates.” Which is of course vague, as most good consultant-ese is. I followed up with some stats: when we work on web content management implementations, we typically end up with about 10-15 content types for a site of medium complexity. We always try to keep the structure simple and number of content types few for many good reasons, ranging from ease of content structure management to content publisher user experience.

The folks on the phone were quiet for a minute… You see, the previous consultant they had worked with had a bit of a different (read opposite) approach. The philosophy they described was that SharePoint content types should be created to the maximum degree of granularity (e.g. one content type per library) so as to reduce the need for content publishers to select a content type and tag metadata values. For example, if you had a site for human resources forms, you would have one library and content type for medical forms, one library and content type for dental forms, etc. Each content type would be extremely specific and require little tagging. “If you need 30,000 content types, then so be it” is the idea. (insert eye twitch.)

The intent behind this – to reduce uncertainty and effort for content publishers – is noble and good, and in some specific cases might be the right approach. But in general, the overly-granular content types seems to be in the realm of sledgehammer to kill a fly. To help explain why, I thought I’d enlist the help of a couple of friends and colleagues.

First, I emailed content management guru Bob Boiko, author of the Content Management Bible, to see if he agreed. His response?

“How many content types is the right number? The fewest possible to squeeze the most value out of the info you possess. If it were my system, I would create a generic type and put all the info that I could not find a business justification for into that bucket. It’s not worth naming if you can’t say clearly why you are managing it. Then I would start with the info we have decided is most valuable and put real energy into naming the type and fleshing out the metadata behind it. Then on to the next most valuable and so on till I ran out of resources. In that way, the effort of typing is spent on the stuff that is most likely to repay the effort.

Amen to that! But I also wanted to get a tool-specific view from my colleague and SharePoint expert friend Shawn Shell. So I skyped him…

ImageSo, what do you think?

Image Well, having a content type for every document library is certainly an interesting approach, though I think your SharePoint administrators, as well as your users, will go quite mad.

ImageSo, I think the argument is that having this many content types is supposed to make it easier on the users by presetting all choices and removing the potential for error. If you never have to choose a content type because each library has a very specific default that matches the content you are creating, then there’s no confusion, the idea seems to be… From a general content management perspective, this is flawed. But what about from a SharePoint-specific standpoint?

ImageI can understand why this might make sense on the surface.  Unfortunately, I think you end up exchanging one kind of confusion for another.  Further, there’s a huge maintenance implication as well. For example, if you have a content type for each library, you are, for all practical purposes requiring the user to decide where to physically store a document.  This physical storage then implies your classification — regardless of whether a default content type is applied.

ImageSo, you’re basically recreating all the ills of a fileshare folder structure!

ImageIn essence yes. To make matters worse, more complex SharePoint environments will necessarily include multiple applications and multiple site collections. Because content types are site collection bound, administrators will have lots more administration to create, maintain and ensure consistency across the applications and site collection. This would normally be true, but when you have such an overload of content types and libraries, the complexities of management are compounded.

ImageSo, if you have 50 content types, and you need to use them in 2 or 3 site collections, you’d have to create 150 content types. Good argument to keep your use of content types judicious. Is there a hard limit to the number of content types one can manage in a site collection?

ImageThe answer is “sort of.”  There’s no specific hard limit to the number of content types in a site collection, but there are some general “soft limits” in the product around numbers of objects (generally 2000). This particular limit is an interface limit where users will see slower performance if you’re trying to display more than 2000 items.  The condition won’t typically manifest itself for normal users, but it will for administration. The other real limit is the content type schema can’t exceed 2 Gb.  While this seems like a pretty high limit, if you have a content type for each library, loads of libraries in a site collection and robust content types, there’s certainly a chance to hit this limit.

ImageWhat about search? I assume that a plethora of content types would have adverse effects on search.

ImageIt absolutely does.  Like everything we’ve discussed here, the impact is primarily two fold: 1) administration and 2) user experience. Content types, as well as columns, can be used as facets for search.  If you have an overwhelming number of facets in results, the value facets bring is reduced.  Plus, as I mentioned before, having large numbers of content types could also produce performance problems when trying to enumerate all of the type included in the search result.

From an administrative standpoint, we’re back to managing all of these content types across site collections, ensuring that the columns in those content types are mapped to managed columns (a requirement for surfacing the metadata in search results) and, if you have multiple Shared Services providers, that this work is done across all SSPs.

ImageI expect there will also be a usability issue for those trying to create content outside of the SharePoint interface. Wouldn’t users have to choose from the plethora of content types if they started in Word for?

ImageThis is another excellent point.  Often, when discussing solutions within SharePoint, we think only of the web interface. When developing any solution, however, you need to keep both the Office and Windows Explorer interface in mind as well. Interestingly, using multiple document libraries, with a content type for each library, makes a little more sense from the end users perspective, since it’s similar to physical file shares and folders.
However, the same challenges that many organizations are facing related to management of file shares can manifest themselves when using the multiple library and matching content type approach as well — putting these organizations back in the same unmanageable place they started.

ImageGreat, thanks Shawn for your insights! I’ll be sure to spread the word to avoid a content type pandemic.

So there you have it folks. As a general rule, less is more. Standardize, simplify and don’t let your content types multiply needlessly. Your content contributors and SharePoint administrators will thank you.

Integrating Taxonomy with CMS – Book Chapter in Publication

The final draft has been submitted… Mark your calendars…

The Information Management Best Practices 2009 book is going to publication this week, in hopes of being ready for launch at the J.Boye Conference in Aarhus, Denmark, Nov 2-4. I’ll be there, giving a talk on SharePoint IA, but also to lend a hand with the book launch activities.

I’m proud to have a chapter in this book, with co-authors Seth EarleyCharlie Gray (CMS & Taxonomy Strategist, Motorola), on one of our most in-depth and successful projects – integrating taxonomy with CMS at Motorola. The best practice covers the steps below in great detail, offering practical advice and screenshots from the actual implementation at Motorola.


  • Step 1: Educate Stakeholders on Taxonomy
  • Step 2: Bring a Taxonomy Expert onto your CMS Implementation Team
  • Step 3: Determine Functional Requirements Continue reading

Taxonomy in Extreme Places

How often do you get to be immersed in a completely alien work environment?

As a taxonomist, I get to learn about so many different domains through my work, from mouse genetics to greeting card manufacturing. Each company has its interesting quirks and workplaces…Like the toy manufacturer, whose workers had their cubicles adorned with all sorts of inspiration and materials: multi-colored fur, googly-eye collections, pictures of themsleves as superheroes…

But this week, I got to experience something completely different.

We just started a content strategy project with a semiconductor equipment manufacturer which aims to help their service groups (the folks who fix the machines) get the right information at the right time. This is an interesting project involving issues around technical writing and information architecture (DITA), integration across many different knowledge systems and databases, and getting information to users in a less than hospitable environment – the clean room.

A clean room is essentially a manufacturing or research facility that has low levels of environmental pollutants, such as dust and microbes. Pollutants are kept to a minimum through air filtering and circulation, as well as a strict dress code involving what are “lovingly” referred to as “bunny suits“. A clean room suit involves:

  • Glove liners
  • Rubber gloves x 2
  • Hair/beard net
  • Face mask
  • Shoe covers
  • Coveralls
  • Hood
  • Booties
  • Safety glasses

You get dresImagesed in a specific sequence so as to reduce contamination… first being the glove liner, rubber glove #1, hairnet, face mask, and shoe covers. Then you enter a second room where you add the hood, coverall, booties, rubber glove #2 and safety glasses. You then walk over some sticky paper into an air lock, where you are blasted with some air, and you’re now ready for the clean room.

Two minutes in a bunny suit and you gain a quick appreciation for the difficulties inherent to working in such an environment. It’s hot under all those layers, you have poor peripheral vision in the hood, the glasses constantly get fogged up from your breath under the mask, and it’s hard to walk. (Well, I have to admit that the “hard to walk” part is probably because I was wearing high-heels in my booties – ill-advised and embarassing! I also made the newbie mistake of taking a cough drop before putting on my mask, and I ended up breathing menthol air into my eyes and fighting back tears the whole time.)

But if I’ve set the scene up appropriately, you can start to imagine the challenges inherent to knowledge work in this environment. First of all,Image it’s hard to get access to information – carrying around a laptop is difficut, your hands are slippery, there’s nowhere to set it down in this lab, nowhere to plug it in… Even if you did find a place for it, you can’t use a track pad when you are wearing 3 layers of gloves – it’s hard to type and the gloves don’t create enough friction for the pad to capture movement. You might use a tablet and stylus, but there are holes in the floor, so if you drop it… You might use a handheld device, but again with gloved hands good luck typing on that tiny keypad, and the screen is much too small to show detailed tool schematics. You don’t have access to the internet, so all the information has to be available on the machine, and there are hundreds of parts for each machine.

Add the next layer: search, systems and content structure. These folks currently have to search across mutliple systems to try to find documentation on specific problems… starting with the original manual, which is likely for the product as it was shipped, not as it was configured at the client site. There are multiple databases where there might be troubleshooting tips or solutions, but you have to check them individually. The content is not well tagged or structured, so if you do find a document that might be useful, it’s typically a gigantic PDF that you have to comb through.

As you can see, this is a challenging problem: how do you get the right information (the right amount of it) in a way that is well structured and accessible to them in the clean room environment? What part of it involves structured writing in XML vs. system integration vs. taxonomy and metadata and how do we pull all those pieces together to offer a simple interface to a service professional?

We’ll be working on this project for the coming weeks, so I’ll keep you posted on the conclusions and insights. But in the mean time, I’m sure this will probably be my personal “one to top” in terms of taxonomizing in extreme places. Perhaps I’ll beat it if we ever do any work with cave spelunkers, submarines, or NASA…

Share your extreme taxonomy stories in the comments!

Photo credits:

Collaboration, Groove and SharePoint – History Repeating Itself?

I just read that Groove is being renamed as SharePoint Workspace 2010.  For those of you who are not familiar with Groove or its history, I’ll take you back to the early 80’s. 

Ray Ozzie is the visionary behind Groove and currently the Chief Software Architect at Microsoft (a role he took over from Bill Gates).  At University of Illinois (as many know, home to the NCSA  which created Mozilla, the first web browser on which Internet Explorer is based) Ozzie worked early iterations of some of today’s knowledge management,  collaboration and social media applications (discussion forums, message boards, e – learning, e-mail, chat rooms, instant messaging, remote screen sharing, and multi-player games.

He also worked with some of the pioneers in personal computing and products like Visicalc, one of the first spreadsheet programs that ushered in the age of personal productivity.

Ozzie worked for a time at Lotus Development and went out to form a new venture called Iris Associates which developed a collaboration tool called Notes.  Lotus acquired rights to Notes with Iris remaining a separate entity but doing all of the research and development behind the product.

Continue reading

OASIS Approves UIMA – the first standard for accessing Unstructured Information

Early last month, OASIS announced the approval of the Unstructured Information Management Architecture Version 1.0.  This standard creates an open method for accessing unstructured information – that is, any information that is created by and for people, and is not inherently machine-readable (e.g., not data).  UIMA can potentially become very important since it provides a standard mechanism to exchange metadata for all types of unstructured content – documents, web pages, email, voice, images and video.

As we all have heard repeated in the marketing messages of every content-related software company, over 80% of the data we run our businesses on is unstructured.  In our business we help our clients tame their mountains of content by classifying it.  Often we rely on technologies like auto-classification, entity extraction, and other analytics to tag content with metadata.  Metadata helps us bring structure – and in turn semantics or meaning – to unstructured content. 

Of course, each of these systems has its own API and its own methods of expressing the metadata it produces or consumes.  This is where UIMA comes in.  In the introduction to the UIMA standard, the team at OASIS describes a typical workflow in which various analytics packages may need to interact:

An example of assigning semantics includes labeling regions of text in a text document with appropriate XML tags that identify the names of organizations or products. Another example may extract elements of a document and insert them in the appropriate fields of a relational database, or use them to create instances of concepts in a knowledgebase. Another example may analyze a voice stream and tag it with the information explicitly identifying the speaker, or identifying a person or a type of physical object in a series of video frames.

Analytics are typically reused and combined together in different flows to perform application-specific aggregate analyses. For example, in the analysis of a document the first analytic may simply identify and label the distinct tokens or words in the document. The next analytic might identify parts of speech, the third might use the output of the previous two to more accurately identify instances of persons, organizations and the relationships between them.

UIMA is unique in that it enables the interoperability of analytics across platforms, frameworks, and content types (text, audio, video, etc.).  If various vendors opt to support UIMA, it would enable “best-of-breed” analytics workflows to be created by linking multiple vendors’ components into solutions.

The areas covered by UIMA are described in the standard as follows:

The UIMA specification defines platform-independent data representations and interfaces for text and multi-modal analytics. The principal objective of the UIMA specification is to support interoperability among analytics.  This objective is subdivided into the following four design goals:

  1. Data Representation. Support the common representation of artifacts and artifact metadata independently of artifact modality and domain model and in a way that is independent of the original representation of the artifact.
  2. Data Modeling and Interchange. Support the platform-independent interchange of analysis data (artifact and its metadata) in a form that facilitates a formal modeling approach and alignment with existing programming systems and standards.
  3. Discovery, Reuse and Composition. Support the discovery, reuse and composition of independently-developed analytics.
  4. Service-Level Interoperability. Support concrete interoperability of independently developed analytics based on a common service description and associated SOAP bindings.

Here at Earley & Associates we will keep our eyes and ears tuned to understand which vendors will incorporate UIMA into their product offerings.  IBM and EMC are on the UIMA working group, and much of the project was based upon an open source project at IBM (IBM’s UIMA was an Alpha Works project).  IBM already touts support for UIMA in several of its information management products, including eDiscovery Analyzer, Content Analyzer, and OmniFind.  There is an incubator project for UIMA-based open source software hosted at the Apache Software Foundation.  As part of the eligibility process for becoming an OASIS standard, successful use of UIMA was verified by IBM, Amsoft, Carnegie Mellon University, Thomson Reuters, and the University of Tokyo. 

We look forward to learning of other vendors who will adopt this standard – it will sure make our jobs easier, and those of others who must integrate systems that produce and consume metadata.