Text mining

Source: Wikipedia, the free encyclopedia.

Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality

named entities
).

Text analysis involves

visualization, and predictive analytics. The overarching goal is, essentially, to turn text into data for analysis, via the application of natural language processing (NLP), different types of algorithms
and analytical methods. An important phase of this process is the interpretation of the gathered information.

A typical application is to scan a set of documents written in a

predictive classification purposes or populate a database or search index with the information extracted. The document is the basic element when starting with text mining. Here, we define a document as a unit of textual data, which normally exists in many types of collections.[3]

Text analytics

Text analytics describes a set of

statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation.[4] The term is roughly synonymous with text mining; indeed, Ronen Feldman modified a 2000 description of "text mining"[5] in 2004 to describe "text analytics".[6] The latter term is now used more frequently in business settings while "text mining" is used in some of the earliest application areas, dating to the 1980s,[7]
notably life-sciences research and government intelligence.

The term text analytics also describes that application of text analytics to respond to business problems, whether independently or in conjunction with query and analysis of fielded, numerical data. It is a truism that 80 percent of business-relevant information originates in unstructured form, primarily text.[8] These techniques and processes discover and present knowledge – facts, business rules, and relationships – that is otherwise locked in textual form, impenetrable to automated processing.

Text analysis processes

Subtasks—components of a larger text-analytics effort—typically include:

Applications

Text mining technology is now broadly applied to a wide variety of government, research, and business needs. All these groups may use text mining for records management and searching documents relevant to their daily activities. Legal professionals may use text mining for

ad placement
, among numerous other activities.

Security applications

Many text mining software packages are marketed for

decryption
.

Biomedical applications

protein docking.[18]

A range of text mining applications in the biomedical literature has been described,

protein interactions,[21][22] and protein-disease associations.[23] In addition, with large patient textual datasets in the clinical field, datasets of demographic information in population studies and adverse event reports, text mining can facilitate clinical studies and precision medicine. Text mining algorithms can facilitate the stratification and indexing of specific clinical events in large patient textual datasets of symptoms, side effects, and comorbidities from electronic health records, event reports, and reports from specific diagnostic tests.[24] One online text mining application in the biomedical literature is PubGene, a publicly accessible search engine that combines biomedical text mining with network visualization.[25][26] GoPubMed is a knowledge-based search engine for biomedical texts. Text mining techniques also enable us to extract unknown knowledge from unstructured documents in the clinical domain[27]

Software applications

Text mining methods and software is also being researched and developed by major firms, including

Weka software is one of the most popular options in the scientific world, acting as an excellent entry point for beginners. For Python programmers, there is an excellent toolkit called NLTK for more general purposes. For more advanced programmers, there's also the Gensim
library, which focuses on word embedding-based text representations.

Online media applications

Text mining is being used by large media companies, such as the

Tribune Company
, to clarify information and to provide readers with greater search experiences, which in turn increases site "stickiness" and revenue. Additionally, on the back end, editors are benefiting by being able to share, associate and package news across properties, significantly increasing opportunities to monetize content.

Business and marketing applications

Text analytics is being used in business, particularly, in marketing, such as in customer relationship management.[29] Coussement and Van den Poel (2008)[30][31] apply it to improve predictive analytics models for customer churn (customer attrition).[30] Text mining is also being applied in stock returns prediction.[32]

Sentiment analysis

Sentiment analysis may involve analysis of products such as movies, books, or hotel reviews for estimating how favorable a review is for the product.[33] Such an analysis may need a labeled data set or labeling of the affectivity of words. Resources for affectivity of words and concepts have been made for

ConceptNet,[35]
respectively.

Text has been used to detect emotions in the related area of affective computing.[36] Text based approaches to affective computing have been used on multiple corpora such as students evaluations, children stories and news stories.

Scientific literature mining and academic applications

The issue of text mining is of importance to publishers who hold large

Document Type Definition
(DTD) that would provide semantic cues to machines to answer specific queries contained within the text without removing publisher barriers to public access.

Academic institutions have also become involved in the text mining initiative:

Methods for scientific literature mining

Computational methods have been developed to assist with information retrieval from scientific literature. Published approaches include methods for searching,[40] determining novelty,[41] and clarifying homonyms[42] among technical reports.

Digital humanities and computational sociology

The automatic analysis of vast textual corpora has created the possibility for scholars to analyze millions of documents in multiple languages with very limited manual intervention. Key enabling technologies have been parsing, machine translation, topic categorization, and machine learning.

Narrative network of US Elections 2012[43]

The automatic parsing of textual corpora has enabled the extraction of actors and their relational networks on a vast scale, turning textual data into network data. The resulting networks, which can contain thousands of nodes, are then analyzed by using tools from network theory to identify the key actors, the key communities or parties, and general properties such as robustness or structural stability of the overall network, or centrality of certain nodes.

subject-verb-object triplets are identified with pairs of actors linked by an action, or pairs formed by actor-object.[43]

Gender bias, readability, content similarity, reader preferences, and even mood have been analyzed based on text mining methods over millions of documents.[46][47][48][49][50] The analysis of readability, gender bias and topic bias was demonstrated in Flaounas et al.[51] showing how different topics have different gender biases and levels of readability; the possibility to detect mood patterns in a vast population by analyzing Twitter content was demonstrated as well.[52][53]

Software

Text mining computer programs are available from many commercial and open source companies and sources.

Intellectual property law

Situation in Europe

Video by Fix Copyright campaign explaining TDM and its copyright issues in the EU, 2016 [3:51]

Under

Information Society Directive
(2001), the UK exception only allows content mining for non-commercial purposes. UK copyright law does not allow this provision to be overridden by contractual terms and conditions.

The European Commission facilitated stakeholder discussion on text and data mining in 2013, under the title of Licenses for Europe.[55] The fact that the focus on the solution to this legal issue was licenses, and not limitations and exceptions to copyright law, led representatives of universities, researchers, libraries, civil society groups and open access publishers to leave the stakeholder dialogue in May 2013.[56]

Situation in the United States

Google Book settlement the presiding judge on the case ruled that Google's digitization project of in-copyright books was lawful, in part because of the transformative uses that the digitization project displayed—one such use being text and data mining.[57]

Situation in Australia

There is no exception in

Copyright Act 1968. The Australian Law Reform Commission has noted that it is unlikely that the "research and study" fair dealing exception would extend to cover such a topic either, given it would be beyond the "reasonable portion" requirement.[58]

Implications

Until recently, websites most often used text-based searches, which only found documents containing specific user-defined words or phrases. Now, through use of a

spam filters as a way of determining the characteristics of messages that are likely to be advertisements or other unwanted material. Text mining plays an important role in determining financial market sentiment
.

See also

References

Citations

  1. ^ "Marti Hearst: What is Text Mining?".
  2. ^ Hotho, A., Nürnberger, A. and Paaß, G. (2005). "A brief survey of text mining". In Ldv Forum, Vol. 20(1), p. 19-62
  3. ^ Feldman, R. and Sanger, J. (2007). The text mining handbook. Cambridge University Press. New York
  4. ^ [1] Archived November 29, 2009, at the Wayback Machine
  5. ^ "KDD-2000 Workshop on Text Mining – Call for Papers". Cs.cmu.edu. Retrieved 2015-02-23.
  6. ^ [2] Archived March 3, 2012, at the Wayback Machine
  7. S2CID 6433117
    .
  8. ^ "Unstructured Data and the 80 Percent Rule". Breakthrough Analysis. August 2008. Retrieved 2015-02-23.
  9. .
  10. .
  11. .
  12. .
  13. .
  14. .
  15. .
  16. ^ "Sentiment Analysis in Twitter < SemEval-2017 Task 4". alt.qcri.org. Retrieved 2018-10-02.
  17. .
  18. .
  19. .
  20. .
  21. .
  22. .
  23. .
  24. .
  25. .
  26. .
  27. .
  28. ^ [3] Archived October 4, 2013, at the Wayback Machine
  29. ^ "Text Analytics". Medallia. Retrieved 2015-02-23.
  30. ^ .
  31. .
  32. .
  33. .
  34. ^ Alessandro Valitutti; Carlo Strapparava; Oliviero Stock (2005). "Developing Affective Lexical Resources" (PDF). PsychNology Journal. 2 (1): 61–83.
  35. ^ Erik Cambria; Robert Speer; Catherine Havasi; Amir Hussain (2010). "SenticNet: a Publicly Available Semantic Resource for Opinion Mining" (PDF). Proceedings of AAAI CSK. pp. 14–18.
  36. S2CID 753606
    .
  37. ^ "The University of Manchester". Manchester.ac.uk. Retrieved 2015-02-23.
  38. ^ "Tsujii Laboratory". Tsujii.is.s.u-tokyo.ac.jp. Archived from the original on 2012-03-07. Retrieved 2015-02-23.
  39. ^ "The University of Tokyo". UTokyo. Retrieved 2015-02-23.
  40. S2CID 13748283
    .
  41. .
  42. .
  43. ^ a b Automated analysis of the US presidential elections using Big Data and network analysis; S Sudhahar, GA Veltri, N Cristianini; Big Data & Society 2 (1), 1-28, 2015
  44. ^ Network analysis of narrative content in large corpora; S Sudhahar, G De Fazio, R Franzosi, N Cristianini; Natural Language Engineering, 1-32, 2013
  45. ^ Quantitative Narrative Analysis; Roberto Franzosi; Emory University © 2010
  46. PMID 28069962
    .
  47. ^ I. Flaounas, M. Turchi, O. Ali, N. Fyson, T. De Bie, N. Mosdell, J. Lewis, N. Cristianini, The Structure of EU Mediasphere, PLoS ONE, Vol. 5(12), pp. e14243, 2010.
  48. ^ Nowcasting Events from the Social Web with Statistical Learning V Lampos, N Cristianini; ACM Transactions on Intelligent Systems and Technology (TIST) 3 (4), 72
  49. ^ NOAM: news outlets analysis and monitoring system; I Flaounas, O Ali, M Turchi, T Snowsill, F Nicart, T De Bie, N Cristianini Proc. of the 2011 ACM SIGMOD international conference on Management of data
  50. ^ Automatic discovery of patterns in media content, N Cristianini, Combinatorial Pattern Matching, 2-13, 2011
  51. ^ I. Flaounas, O. Ali, T. Lansdall-Welfare, T. De Bie, N. Mosdell, J. Lewis, N. Cristianini, RESEARCH METHODS IN THE AGE OF DIGITAL JOURNALISM, Digital Journalism, Routledge, 2012
  52. ^ Circadian Mood Variations in Twitter Content; Fabon Dzogang, Stafford Lightman, Nello Cristianini. Brain and Neuroscience Advances, 1, 2398212817744501.
  53. ^ Effects of the Recession on Public Mood in the UK; T Lansdall-Welfare, V Lampos, N Cristianini; Mining Social Network Dynamics (MSND) session on Social Media Applications
  54. ^ Researchers given data mining right under new UK copyright laws Archived June 9, 2014, at the Wayback Machine
  55. ^ "Licences for Europe – Structured Stakeholder Dialogue 2013". European Commission. Retrieved 14 November 2014.
  56. ^ "Text and Data Mining:Its importance and the need for change in Europe". Association of European Research Libraries. 2013-04-25. Archived from the original on 2014-11-29. Retrieved 14 November 2014.
  57. ^ "Judge grants summary judgment in favor of Google Books — a fair use victory". Lexology. Antonelli Law Ltd. 19 November 2013. Retrieved 14 November 2014.
  58. ^ "Text and data mining". Australian Law Reform Commission. 4 June 2013. Retrieved 10 February 2023.

Sources

External links