Exponential Growth of Unstructured Data and Its Risk to Life Sciences R&D Organizations

Managing the volume, variety and velocity of data is one of the best known—and least understood—challenges in the Life Sciences industry. According to IDC, data grows 50% year-over-year, which is significant. This tremendous growth is causing a major obstacle across the industry, specifically in the area of Research & Development.

At the highest level, data comes in three forms:

Structured Data, which refer to any data that resides in a fixed field within a record or file and has the advantage of being easily entered, stored, queried and analyzed (e.g., the data in your ERP system).

Unstructured Data, which are all those things that can’t be readily classified and organized such as  photos, graphics, videos, webpages, PDF files, PowerPoint presentations, emails, spreadsheets, and word processing documents.

Semi-Structured Data, as the name suggests, is a cross between structured and Unstructured Data. It‘s a type of structured data, but lacks the strict data model structure.(e.g. patient package inserts).

Structured data is relatively easy for computers to understand and manage, but semi-structured and unstructured data (more 70% of the data in the enterprise) presents a bigger problem. Semi- and unstructured data is comprised of text and contains information that is critical to the enterprise. Since computers cannot interpret the information in these information assets, it typically falls to a human to read and interpret the content and extract the relevant facts. Semi- and unstructured data represent one of the biggest risks to the Life Sciences industry, particularly in the area of Health Authority compliance.

Semi and Unstructured Data: Concerns with Current Approach

The first problem with semi- and unstructured data is “simply” identifying and finding it—in quotes because the solution is anything but simple. For example, if you’re searching for a legacy health authority correspondence or product registration information that is either in a Word document or PDF, there is no way to really identify in which of the millions of documents the information resides.

Specifically, there’s no efficient way to (a) find the data you’re looking for, let alone (b) know which document or file that information resides in.

As a consequence, an inordinate amount of time is exhausted identifying and classifying the data in support of basic internal searches to very challenging efforts such as responding to a Health Authority request for information. The problem is exacerbated when you begin to think about routine compliance tasks such as Adverse Event Reporting, where you need to extract the relevant facts from free-form submissions from healthcare providers, or IDMP, where up to 50% of the information needed for submission is buried in your documents.

Best Practices for Addressing Unstructured and Semi-structured Data

Start with a defined problem area and expand to additional problem areas after you have solved the first problem.  Don’t try to boil the ocean.

Involve subject matter experts. SMEs are experts because they’ve not only implemented these projects before, but more importantly they stay up-to-date on industry standards and shifting vocabulary. They are better equipped to assist you in developing, managing, and maintaining both internal and external vocabularies.

Don’t start with a blank sheet of paper.  Take advantage of publicly available vocabularies which can be enriched with your vernacular.

Take an iterative approach to vocabulary development and fact extraction.  Fine tune your vocabulary and classification strategy to get the desired results.

Remember that the vocabulary is a living thing and will need to be curated as your business evolves.

Don’t assume automation will completely displace the involvement of business experts.  There is always a need for human decision making. Automation only takes over the routine tasks (and does so with less errors and more cheaply).


If you want to know more about how technology is assisting industry experts in organizing unstructured data, check out this white paper on Content Intelligence written by the SMEs at Smartlogic.



Tired of reinventing the wheel with every new R&D initiative? Download our R&D Workbench Webinar to learn about the benefits of institutional memory.


Are you prepared for upcoming IDMP data standard changes? Join us Wednesday, October 26 at 11:00 a.m. EST 4:00 p.m. CET for our IDMP Substances Webinar – a unique opportunity to collaborate with industry experts and peers.



Smartlogic, the Content Intelligence Company, and Highpoint Solutions have teamed up solve this pressing problem. The combination of Smartlogic’s sophisticated semantic technologies platform and Highpoint Solutions’ deep knowledge of the Life Sciences industry is helping Life Sciences organizations everywhere to find, manage and leverage the value in their semi-structured and unstructured information assets.


About Smartlogic

Smartlogic’s Semaphore is an enterprise grade Content Intelligence platform that complements an organization’s investment in enterprise search, workflow, analytics and content management systems. Semaphore is the bridge between human language and computers. It extracts the human intelligence from information assets and applies precise and consistent metadata to harmonize information and drive business decisions.