Advertisement

Five Technical Challenges of EIM

By on

by Dave Reiner

Holistic information for business processes

Enterprise Information Management (EIM) is a strategic combination of components and services that weave together and deliver holistic information—consistent, timely, and meaningful—to business processes. From financial services, pharmaceuticals, and manufacturing to retailing, telecommunications, intelligence-gathering, and the energy sector, dozens of industries regard EIM as central to business competitiveness and differentiation.

A typical analytic use case is to manage enterprise performance, with information sources including competitive web sites, analyst reports, and SEC filings in addition to the more traditional databases of revenues and costs by organization, region, and product line. On the operational side, a typical use case is to identify and classify complex enterprise events to accelerate enterprise response to business process delays, supply chain failures, or rapidly evolving external perceptions of the enterprise.

EIM use cases generally require combining structured and unstructured information from a broad set of sources. Many of the use cases have a dynamic, near-real-time flavor. Most necessitate an understanding of information meaning and interrelationships to help create holistic views in context. And all are looking to improve the experience for the end user for whom ease of accessing, navigating, and distilling information translate into business advantage.

In short, EIM use cases present five complex technical challenges:

1. Dealing with varying degrees of structure in information sources
2. Dynamically locating information and accessing it securely
3. Understanding the meaning of information
4. Integrating federated, heterogeneous information
5. Facilitating user navigation, visualization, and analysis of information

The five sections below give a more comprehensive view of these challenges. In a follow-up to this article, I’ll address the new stack of EIM components and services for near real-time information access, analysis, integration and governance, and propose some practical starting points.

1.  Dealing with varying degrees of structure in information sources

From planning documents, public web pages, and sales figures stored in corporate databases to customer e-mails and RSS feeds, business information sources have varying degrees of structure. The most structured sources, where tabular information is regimented into rows with identical typed columns and keys, include relational and legacy databases, multidimensional databases, flat files in record format, and spreadsheet tables. The least structured sources, with little external structure defining information content, format, or meaning, include most documents, images and graphics, audio and video files, web pages, wikis, blogs, instant messages, source code, and reports.  In the middle is a category of semi-structured sources, such as e-mails, XML data, RDF knowledge graphs, EDI documents, and RSS feeds.

Structured sources such as a relational database have explicit metadata or schemas that describe their information in detail.  Semi-structured sources have partial metadata, such as an email’s author and subject. A semi-structured XML document may have an associated XML Schema Definition (XSD) file constraining its elements and attributes. Unstructured sources (and the unstructured portions of semi-structured sources, such as the body of an e-mail) do not have explicit metadata, so they must be indexed, tagged, or otherwise analyzed before they can be filtered and combined with other sources. There’s tremendous value, for example, in taking structured data about customers and linking it to relevant unstructured and semi-structured information such as e-mails, scanned hardcopy correspondence, and notes from conversations. It provides rich context that can enable businesses to deliver much more responsive customer service. A database that lists terrorist sightings becomes simultaneously more powerful and more nuanced when it’s combined with photos, video, audio, and anecdotal text.

2.  Dynamically locating information and accessing it securely

A straightforward marketing query about customer-spend-to-date above structured sources of customer information may need access to multiple databases to compute an aggregated total.  Some preprocessing may be done ahead of time (data warehouses and marts come to mind), but dynamically computed, near-real-time results are desirable. While the results must often be assembled from a combination of SQL, API calls, and information service requests against enterprise systems, the breadth of sources increases the accuracy and completeness of the answer.

The search box is fast becoming ubiquitous in user applications.  Executing a search query against unstructured information sources, such as research publications on pharmaceutical trial outcomes, depends on efficient and comprehensive search indexes, on the quality of content tagging, and on access rights to the underlying information sources. These factors make search results more valuable to users.

In a typical corporate litigation, responding to a broad-scope eDiscovery request means sifting through dozens of loosely connected systems, ranging from formal databases, content repositories, and archives to individual PC file systems. Locating the right information greatly reduces corporate risk exposure.

Concerns about security, privacy, and compliance complicate the picture. Credentials for secure access vary across diverse enterprise systems, and security breaches can bring enormous liability. Authentication, authorization, and accounting need to be strictly enforced at lower levels of access to databases, repositories, file systems, and information services, and whenever information lifecycle events such as archiving occur. Privacy violations may arise from secondary use or non-obfuscation of personal identifying information (PII), and from combining information across disparate sources.

With respect to compliance, most personal and sensitive information falls under regulatory protection. The Gramm-Leach-Bliley Act requires corporate and government institutions to prevent unauthorized access to non-public, personal information, including who their customers are. The Health Insurance Portability and Accountability Act’s (HIPAA) Privacy Rule establishes regulations for the use and disclosure of protected health information (PHI).

3.  Understanding the meaning of information

Across structured systems, there are subtle differences in definitions of customers, product categories, cost structures, revenue recognition, claim status, event classification, and so on. The very absence of a data value may be interpreted asunknown, not applicable, zero, or as a default value. The potential for misinterpretation by diverse applications is huge. Metadata helps query and system builders understand the source, derivation, and intended use of fields, but it may be unavailable or hidden below a business view of information. Information transformation and mapping approaches can clean and normalize data to create integrated views, but sometimes at the cost of burying semantic knowledge in transformations.

Where unstructured or semi-structured information is involved, many aspects of meaning need to be derived or extracted in a process often called content analysis. Metadata derived from content such as documents, e-mails, or web pages includes:

  • Attributes (such as “document author”)
  • Tags (such as “political” or “software”)
  • Classifications (such as “mission-critical”)
  • Extracted entities (such as “Kenneth Lay” and “Enron”)
  • Extracted relationships (such as “Dave Reiner works at EMC”)
  • Sentiments (such as “negative” in a customer complaint e-mail)
  • Clusters (such as related documents)
  • Indexes and facets (for search and navigation)
  • Mappings (to related information)
  • Information usage patterns (for optimization and governance)

Such metadata adds considerable value to information and enables it to be indexed, catalogued, searched, retrieved, and reused. Automated content analysis and metadata extraction reduce the burden on human users to classify or tag information.

Several key concepts from the semantic web,1,2 help to express the meaning of content to software agents and people. An ontology is a precisely defined common terminology for a specific domain of knowledge such as vehicles. It describes entities (e.g., people, vehicles), attributes (model, year, type, mileage), interrelationships (people own vehicles, people drive vehicles), and rules (Massachusetts motorcycle drivers must be at least 16 years old).

An ontology can be used as a basis for classification, inference, search, mapping, and reasoning about entities in a domain. For example, “physicians” and “doctors” may be regarded as equivalent in a medical ontology. Web Ontology Language (OWL)3 enables knowledge-level interoperation of intelligent agents and allows programs to access information structure and content.  OWL extends Resource Description Framework (RDF),4 which supports a more limited repertoire of resource and property description. To describe web services, the Web Services Description Language (WSDL)5 is foundational. But WSDL is purely a syntactic approach; its recent proposed extension with SAWSDL6 will allow web services to be tied to semantics critical to understanding domain specifics information.

4.  Integrating federated, heterogeneous information

Autonomous information sources may have different data models, schemas, naming conventions, attribute domains, data quality, value precision, information currency, encodings, and constraints.Federated information refers to information from systems and sources where overall central authority is weak, although partial sharing and coordination are possible.  The goal in integrating federated, heterogeneous information is to use tools and processes that deal with ambiguities, simultaneously missing as little as possible and keeping incorrect matches to a minimum.

In addition to point-to-point integration, many historical solutions to the integration challenge have their roots in the extract-transform-load (ETL) tools used to build data warehouses. Data is extracted from its sources through queries, interfaces, and adapters.  It is transformed, cleansed and reconciled according to relatively arcane business rules, and then loaded as quickly as possible into a data warehouse. In turn, the data warehouse supplies extracts known asdata marts that serve functional or geographic areas of the business. This is usually done overnight through a batch process, whose results reflect a particular point in time such as the end of the business day.

However, overnight consolidation doesn’t meet the needs of processes such as supply chain management, where a near-real-time view of federated information is required. Rather than periodically moving and combining data, the challenge is to leave information where it resides and retrieve it on demand. This requires coordination and optimization of queries against federated information sources, which may have different query languages and service interfaces. There are tradeoffs between the extremes of materializing all information and fetching just what is required on demand.

An underlying technical challenge involves matching entities across information sources. Information about customers, products, employees and partners is scattered and often inconsistent. But business processes such as revenue recognition, customer service, and financial rollup need to see a unique reference model, or “single version of the truth.” This challenge, which blends data modeling, fuzzy matching, and synchronization aspects with cross-organizational politics, is usually called master data management.

For health care, the entities to be matched include patient, provider, and location; for insurance, they include consumer, provider, incident, and claim.  A more general, related challenge is change data capture, where data in a database or elsewhere has changed and may require change propagation or data synchronization.

The challenges of unstructured or semi-structured content integration are mainly at the metadata level. The underlying information is integrated (or at least captured or referenced) based on indexes, associated tags, extracted entities and relationships, and document clusters.  To assess relevancy, this metadata needs to be compared to search targets and constraints, and expanded to factor in the initial search context and to apply any domain-specific ontologies that assist the automated interpretation of meaning.

5.  Facilitating user navigation, analysis, and visualization of information

Delivering holistic information to a user or business process is not the end of the line. Users examine, rearrange, narrow, and analyze their information incrementally. It’s challenging to anticipate user information needs, to reformat and repackage results, to support navigation of virtual information views, and to simplify user analysis and visualization of information.

Anticipating user information needs starts with database design, which requires understanding the types and scope of user queries, and deciding how to collect, structure, and group information. More generally, EIM entails predicting important information sources and conveying their contents to users. Such planning does not preclude later searching and serendipitous discovery of valuable information.

Navigating information means bridging to related information or drilling up or down with respect to the level of detail. For navigation, a useful generalization is the idea of dimensions of information. Traditionally, structured enterprise information has dimensions such as time, business unit, geographic region, product family, product price, customer segment, and sales channel. Unstructured content also has dimensions, or facets, which depend on content attributes, tags, classifications, and associated metadata. For example, the facets on a consumer product company’s shopping website might include manufacturer, price range, color, size, and average consumer rating. Constraints on dimensions represent a narrowing of context, while navigating dimensions is a natural way for users to link to related information. Dynamic tags generated by social interaction (e.g., through del.icio.us) can be valuable as additional dimensions. This is particularly important for rich digital media.  Information users may also need automated assistance to make sense of their data sets by identifying, interrelating, clustering, and grouping.

Information analysis depends on metrics—periodic, quantitative assessments of business processes—derived from business information. The challenge is to compare past, present, and predicted future metrics, looking to detect and understand significant trends and patterns.

Analysis and navigation tools, from simple reporting to data mining, need connectivity to information services and sources through various protocols and access frameworks. Examples are JDBC, ODBC, proprietary APIs, REST,7 and SOAP-based web service invocations. Service interfaces, adapters to information sources, and even “web page scrapers” can hide unnecessary details. For complex event processing applications, such as dashboards, arbitrage, and data center monitoring, the information sources to be analyzed may be near-real-time data streams. Mashup8technologies make it easier for the next generation of knowledge workers to combine information in new ways. And visual representations of abstract information help users to explore and understand it. The challenge is not to present flashy visual effects but to show meaningful aspects of the data that amplify cognition and engage the pattern detection strengths of the human visual system.9 From timelines, radar charts, and treemaps to mindmaps, heat graphs, and mashups, visualizations can evoke the “aha!” response in a way that no spreadsheet or drab report can match.10

Conclusion

Sparked by a detailed study of use cases and scenarios for EIM, in this article I’ve distilled and probed five central technical challenges in delivering holistic information to business processes and users, Understanding EIM use cases and the technical challenges they present lays the groundwork for a later discussion of the components and services in the expanding EIM stack and a set of practical starting points for EIM.

Footnotes

1. Berners-Lee, Tim. Scientific American, May, 2001.
2. World Wide Web Consortium (W3C), Semantic Web Activity
3. World Wide Web Consortium (W3C), OWL Web Ontology Language Overview, May 10, 2004
4. World Wide Web Consortium (W3C), Resource Description Framework (RDF),2004
5. World Wide Web Consortium (W3C), Web Services Description Language (WSDL) Version 2.0 Part 0: Primer, 2007
6. Kopecky, Jacek; Vitvar, Tomas; Bournez, Carine; Farrell, Joel. “SAWSDL: Semantic Annotations for WSDL and XML Schema,” IEEE Internet Computing, November/December 2007, pp. 60-67.
7. Fielding, Roy; Taylor, Richard. “Principled Design of the Modern Web Architecture”, ACM Transactions on Internet Technology, May, 2002.
8. http://en.wikipedia.org/wiki/Mashup_%28web_application_hybrid%29
9. Few, Stephen. “A Ménage à Trois of Data, Eyes and Mind,” DM Review, January, 2008.
10. Lengler, Ralph; Eppler, Martin. A Periodic Table of Visualization Methods

ABOUT THE AUTHOR

Dave Reiner

Dr. Dave Reiner has a long history of innovation in database and information management technologies.  Within EMC’s CTO Office, Dave focuses on technical strategy for information virtualization and integration. He has been involved with adaptive performance tuning, cache optimization, data center management, query optimization, data mining, database marketing, and parallel computing technologies for 30 years. Dave holds multiple patents on parallel query and web mining algorithms, was Editor-in-Chief of IEEE Data Engineering, and co-edited Query Processing in Database Systems. Prior to EMC, Dave architected CRM software at NetGenesis and Fidelity Investments, was Chief Scientist for database marketing at Epsilon, invented parallel database technology at Kendall Square Research, and directed database research at Computer Corporation of America. Before completing his Ph.D. in Computer Science at University of Wisconsin-Madison, he taught high school math as a Peace Corps volunteer in Zaire. A champion fiddler and mandolin player, Dave has a series of instructional books and recorded CDs to his credit, and can be found playing bluegrass, Irish, and oldtime music on weekends with the Reiner Family Band.

Leave a Reply