Unstructured data
Unstructured data (or unstructured information) is information that either does not have a pre-defined
In 1998,
As of 2012[update],
Background
The earliest research into
Issues with terminology
The term is imprecise for several reasons:
- Structure, while not formally defined, can still be implied.
- Data with some form of structure may still be characterized as unstructured if its structure is not helpful for the processing task at hand.
- Unstructured information might have some structure (semi-structured) or even be highly structured but in ways that are unanticipated or unannounced.
Dealing with unstructured data
Techniques such as
Software that creates machine-processable structure can utilize the linguistic, auditory, and visual structure that exist in all forms of human communication.
Since unstructured data commonly occurs in
Approaches in natural language processing
Specific computational workflows have been developed to impose structure upon the unstructured data contained within text documents. These workflows are generally designed to handle sets of thousands or even millions of documents, or far more than manual approaches to annotation may permit. Several of these approaches are based upon the concept of online analytical processing, or OLAP, and may be supported by data models such as text cubes.[14] Once document metadata is available through a data model, generating summaries of subsets of documents (i.e., cells within a text cube) may be performed with phrase-based approaches.[15]
Approaches in medicine and biomedical research
Biomedical research generates one major source of unstructured data as researchers often publish their findings in scholarly journals. Though the language in these documents is challenging to derive structural elements from (e.g., due to the complicated technical vocabulary contained within and the domain knowledge required to fully contextualize observations), the results of these activities may yield links between technical and medical studies[16] and clues regarding new disease therapies.[17] Recent efforts to enforce structure upon biomedical documents include self-organizing map approaches for identifying topics among documents,[18] general-purpose unsupervised algorithms,[19] and an application of the CaseOLAP workflow[15] to determine associations between protein names and cardiovascular disease topics in the literature.[20] CaseOLAP defines phrase-category relationships in an accurate (identifies relationships), consistent (highly reproducible), and efficient manner. This platform offers enhanced accessibility and empowers the biomedical community with phrase-mining tools for widespread biomedical research applications.[20]
The use of "unstructured" in data privacy regulations
In Sweden (EU), pre 2018, some data privacy regulations did not apply if the data in question was confirmed as "unstructured".
- Parts of GDPR Recital 15, "The protection of natural persons should apply to the processing of personal data ... if ... contained in a filing system."
- GDPR Article 4, "‘filing system’ means any structured set of personal data which are accessible according to specific criteria ..."
GDPR Case-law on what defines a "filing system"; "the specific criterion and the specific form in which the set of personal data collected by each of the members who engage in preaching is actually structured is irrelevant, so long as that set of data makes it possible for the data relating to a specific person who has been contacted to be easily retrieved, which is however for the referring court to ascertain in the light of all the circumstances of the case in the main proceedings.” (CJEU, Todistajat v. Tietosuojavaltuutettu, Jehovan, Paragraph 61).
If personal data is easily retrieved - then it is a filing system and - then it is in scope for GDPR regardless of being "structured" or "unstructured". Most electronic systems today,[as of?] subject to access and applied software, can allow for easy retrieval of data.
See also
Notes
- ^ Today's Challenge in Government: What to do with Unstructured Information and Why Doing Nothing Isn't An Option, Noel Yuhanna, Principal Analyst, Forrester Research, Nov 2010
References
- ^ Shilakes, Christopher C.; Tylman, Julie (16 Nov 1998). "Enterprise Information Portals" (PDF). Merrill Lynch. Archived from the original (PDF) on 24 July 2011.
- ^ Grimes, Seth (1 August 2008). "Unstructured Data and the 80 Percent Rule". Breakthrough Analysis - Bridgepoints. Clarabridge.
- ISSN 0268-4012.
- ^ "The biggest data challenges that you might not even know you have - Watson". Watson. 2016-05-25. Retrieved 2018-10-02.
- ^ "Structured vs. Unstructured Data". www.datamation.com. Retrieved 2018-10-02.
- ^ "EMC News Press Release: New Digital Universe Study Reveals Big Data Gap: Less Than 1% of World's Data is Analyzed; Less Than 20% is Protected". www.emc.com. EMC Corporation. December 2012.
- ^ "Trends | Seagate US". Seagate.com. Retrieved 2018-10-01.
- ^ a b Grimes, Seth. "A Brief History of Text Analytics". B Eye Network. Retrieved June 24, 2016.
- ^ Albright, Russ. "Taming Text with the SVD" (PDF). SAS. Archived from the original (PDF) on 2016-09-30. Retrieved June 24, 2016.
- ^ Desai, Manish (2009-08-09). "Applications of Text Analytics". My Business Analytics @ Blogspot. Retrieved June 24, 2016.
- ^ Chakraborty, Goutam. "Analysis of Unstructured Data: Applications of Text Analytics and Sentiment Mining" (PDF). SAS. Retrieved June 24, 2016.
- ^ "Structure, Models and Meaning: Is "unstructured" data merely unmodeled?". InformationWeek. March 1, 2005.
- ^ Malone, Robert (April 5, 2007). "Structuring Unstructured Data". Forbes.
- S2CID 1522480.
- ^ a b Tao, Fangbo; Zhuang, Honglei; Yu, Chi Wang; Wang, Qi; Cassidy, Taylor; Kaplan, Lance; Voss, Clare; Han, Jiawei (2016). "Multi-Dimensional, Phrase-Based Summarization in Text Cubes" (PDF).
- S2CID 31449783.
- PMID 26420781.
- PMID 23554924.
- PMID 25411329.
- ^ PMID 29775406.
- ^ "Swedish data privacy regulations discontinue separation of "unstructured" and "structured"".