The term unstructured data has been used prolifically over the past several years, but has not been defined with a comprehensive, denotative definition. Like many other terms in data resource management, unstructured data is part of a large and growing lexical challenge. That lexical challenge must be resolved, and terms must be based on the valid meaning of roots, prefixes, and suffixes to have a formal data management profession.
The first issue to answer is the origin of the term unstructured data. The term originated from database technicians who were unable to store data in traditional tables, and retrieve those data with SQL (Structured Query Language). Since the operative term in SQL is structured, and the data could not be retrieved with SQL, then those data must obviously be unstructured. That term has persisted—with respect to SQL.
However, going to the dictionary shows that unstructured means not structured, having few formal requirements, or not having a patterned organization; without structure, having no structure, or structureless. Extending that definition to unstructured data means that unstructured data are not structured, have few formal requirements, or do not have a patterned organization. In other words, unstructured data are an amorphous mess that has no structure.
Looking for better terms to replace unstructured data led to non-tabular data, since those data could not be stored in or retrieved tables. The term was meaningful to many people, but was criticized for stating what the data were not rather than stating what the data were. Looking further for a better term led to super-structured data.
Super means over and above, higher in quantity, quality or degree; exceeding a norm in excessive degree or intensity, surpassing all or most others of its kind; situated or placed above, on, or at the top of, situated on the dorsal side; having the ingredient present in a large or unusual large portion; constituting a more inclusive category than that specified; superior in status, title, or position.
Structured means something arranged in a definite pattern or organization; manner of construction; the arrangement of particles or parts in a substrate or body, arrangement or interrelation of parts as dominated by the general character of the whole; the aggregate of elements of an entity in their relationships to each other, the composition of conscious experience with its elements and their combination.
Super-structured data are any data that are structured in a manner more intricate than tabular data and, therefore, cannot be retrieved by structured query languages or tools. Super-structured data can be analyzed to reduce that intricate structure to simpler structures for processing by structured languages and tools, or by other languages and tools.
Comments received about super-structured data showed that the message had been received that the term was wrong and a more appropriate term was needed. However, the term super-structured data was not the right term, because it was easily confused with the term superstructure meaning a vertical extension of something above a base, such as the superstructure of a battleship, and had no meaning with respect to data. What was interesting about the comments is that people used the dictionary to show that super-structured was the wrong term, yet those same people did not use the dictionary to show that unstructured was the proper term! Such an approach is typical in the lexical challenge in data resource management.
A better term was sought that represented an intricate interweaving of multiple structures. Complex means composed of two or more parts; having a bound form; hard to separate, analyze, or solve; a whole made up of complicated or interrelated parts; a composite made up of distinct parts; intricate as having many complexly interrelating parts or elements. Complex structured data are any data that are composed of two or more intricate, complicated, and interrelated parts that cannot be easily interpreted by structured query languages and tools. The complex structure needs to be broken down into the individual component structures to be more easily processed.
Next, the term semi-structured data had to be resolved. Semi-structured data was a loose term that represented a data structure between structured data and unstructured data. An excellent replacement term is highly structured data, which are any data that are more intricately structured that traditional tabular data, but are not as intricately structured as complex structured data. Therefore, a sequence is established for unstructured data, structured data, highly structured data, and complex structured data. That sequence seems to be well received.
Terms like poly-structured data and multi-structured data have been used, but with reference to database management systems. They are not used with reference to an organization’s entire data resource, including data within and without database management systems. Those terms are poor replacements for complex structured data.
Complex structured data include text, voice, video, images, spatial data, and so on. Complex structured data can be broken down into simpler structures that can be more easily stored and retrieved. For example, textual data are richly structured physically, grammatically, and semantically. Text can be analyzed by a variety of methods, either human or automated, to determine precedents and conclusions, who wrote the text based on analysis and comparison with an author’s known writing, and so on. Each of these parameters can be documented as structured data to provide insight into the text.
Similarly, voice data is textual data with intonations and inflections in the voice, which can be analyzed. That approach is the basis for the psychological stress evaluator (PSE) often used in law enforcement. Video data is text, plus voice, plus body movements that can also be analyzed for body language, such as eye movement, eyelid blinking, mouth movements, arms, posture, and so on, to determine a person’s true feeling. Video analysis has a wide range of use from job interviews, to legislative testimony, to jury selection.
A similar example is a polygraph examination to determine if someone is truthful or is showing deception. A polygraph examination consists of a series of questions asked by the examiner. The respondent answers either Yes or No to each question. The blood pressure, pulse, perspiration, respiration, and so on, are measured during each question and answer to provide an indication of truth or deception. The results are analyzed at the end of the examination to determine if the respondent is truthful or deceptive.
The complex structured data from a polygraph examination could be broken down to simpler structures. A polygraph examination has an examiner, a respondent, date, time, location, machine used, machine calibration, polygraph strip, and so on. Each polygraph examination has many questions, which includes the question, the sequence of the question, the time of the question, and so on. Each question has a series of responses, such as verbal, blood pressure, pulse, perspiration, respiration, and so on. The results of each question are evaluated to provide an indication of truth or deception for the question. The results of all questions are evaluated for an indication of truth or deception for the entire polygraph examination.
The same situation exists with computed tomography scans (CT) and magnetic resonance imaging (MRI) where the data can be combined into three dimensional views and then analyzed for diagnosis and treatment of diseases. Spatial data (point, line, area, and volumes), images, and other forms of highly structured and complex structured data can be analyzed in a similar manner. The analysis can be manual or automated, and can be done for operational, analytical, or predictive data. Therefore, highly structured and complex structured data can be managed as a set of sub-structures that comprise the complex structure.
I recently saw an announcement for a presentation on correlating structured and unstructured data. I thought it interesting that anyone was even trying to correlate structured data and unstructured data, and that unstructured data were even considered for correlation to anything.
Use of the terms structured data, highly structured data and complex structured data helps resolve the lexical challenge and begins providing meaningful terms. People frequently ask what is perpetuating the lexical challenge in data resource management. The answer is that people are simply pumping the words without realizing what they are saying or what the words really mean. Data management professionals must stop pumping words and terms without understanding their true meaning, and start using words and terms that have a comprehensive and denotative meaning based on roots, prefixes, and suffixes as defined in the dictionary. That’s the only way to stop the lexical challenge and promote a formal data management profession.