Skip to Main Content

Text Data Mining

Resources for working with text as data, including corpus preparation, tutorials, data sources, and lists of tools.

Text Data Mining

NYU Libraries provides support for all phases of your text mining and data mining work, from corpus selection, cleaning, and parsing to analysis and visualization. Support is offered through several departments in the library, including Data Services, Digital Scholarship Services, and Digital Library Technology Services. While we can provide advice through consultation and help you deploy research questions in the software, you are responsible for creating the methodological questions and framework.

Questions? Contact us by emailing data.services@nyu.edu or fill out our consultation request form.

Why TDM?

Why do TDM?

TDM can help reveal new patterns or information from a large body of work -- leading to the development of new knowledge, of a larger evidence-based practice. TDM enables researchers to analyze thousands of documents and Terabytes of data, allowing for a comprehensive look into research questions.

Kinds of questions TDM can answer?

TDM can help answer such a variety of questions it would be hard to list them all! Some potential use cases are listed below, but do consult the literature of your field to see potentially how TDM is being used to answer the questions of your domain.

  • One of the most common uses of text and data mining is in the analysis of social media data. Social media has a bevy of structured and unstructured data, from which scholars of many kinds can derive interesting information. One might ask, "What do people of XYZ demographic think of ZYX issue?" One way to approach this question is to mine Twitter and do a sentiment analysis.
  • You might also ask, "Are men cited in [domain of study] more than women?" and approach the question by mining bibliographies from papers in a given domain, and looking at gender of authors cited at bulk to get some statistics.
  • You might be in data science and need a lot of data in order to train a classifier for a machine learning project. Mining data from an applicable source (e.g. Flickr for images, scholarly articles from a database, etc.) can provide a lot of training data for a machine.
  • You might want to examine government records of the financial transactions to see if there is any shady donations from lobbyists to government actors. By bulk examining these documents with TDM, you can find outliers or illuminating patterns in financial transactions.