Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

Text Data Mining

Resources for working with text as data, including corpus preparation, tutorials, data sources, and lists of tools.

What is Text & Data Mining?

Data mining is a research technique using computational analysis to uncover patterns in large data sets.  Data mining techniques range from machine learning applications, to GIS and mapping, to business intelligence. The range of data types makes data mining techniques harder to pin down.

Text mining is the process of deriving information from textual data. Text mining techniques might include sentiment analysis, network analysis, word frequency distributions, pattern recognition, tagging/annotation, information extraction, and the production of granular taxonomies or ontologies.

This kind of analytic tool is useful in numerous scholarly fields, from the humanities to the sciences, where useful data can be "mined" from large non-text datasets and from text databases of the published literature (Source: UMass Amherst Libraries).

Questions? Contact us by emailing data.services@nyu.edu, or fill out our consultation request form and we'll get back to you.

Why TDM?

Why do TDM?

TDM can help reveal new patterns or information from a large body of work -- leading to the development of new knowledge, of a larger evidence-based practice. TDM enables researchers to analyze thousands of documents and Terabytes of data, allowing for a comprehensive look into research questions.

Kinds of questions TDM can answer?

TDM can help answer such a variety of questions it would be hard to list them all! Some potential use cases are listed below, but do consult the literature of your field to see potentially how TDM is being used to answer the questions of your domain.

  • One of the most common uses of text and data mining is in the analysis of social media data. Social media has a bevy of structured and unstructured data, from which scholars of many kinds can derive interesting information. One might ask, "What do people of XYZ demographic think of ZYX issue?" One way to approach this question is to mine Twitter and do a sentiment analysis.
  • You might also ask, "Are men cited in [domain of study] more than women?" and approach the question by mining bibliographies from papers in a given domain, and looking at gender of authors cited at bulk to get some statistics.
  • You might be in data science and need a lot of data in order to train a classifier for a machine learning project. Mining data from an applicable source (e.g. Flickr for images, scholarly articles from a database, etc.) can provide a lot of training data for a machine.
  • You might want to examine government records of the financial transactions to see if there is any shady donations from lobbyists to government actors. By bulk examining these documents with TDM, you can find outliers or illuminating patterns in financial transactions.

Examples of TDM Projects