Skip to Main Content

Text Data Mining

Resources for working with text as data, including corpus preparation, tutorials, data sources, and lists of tools.

General Information

NYU Data Services: NYU Libraries and Information Technology logo.




For assistance, reach out by chat below or submit a request

We can be reached by email at

Join our Discord server

If you've met with us before, tell us how we're doing

Service Desk and Chat

 Bobst Library, 5th floor

 Staffed Hours: Fall 2023
     Mondays:     12pm - 5pm
   Tuesdays:    12pm - 5pm
   Wednesdays:  12pm - 5pm
   Thursdays:   12pm - 5pm
   Fridays:     12pm - 5pm

chat loading...


Creative Commons License logo.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Please attribute this work to the NYU Libraries Scholarly Communications and Information Policy Department.

Text Data Mining

NYU Libraries provides support for all phases of your text mining and data mining work, from corpus selection, cleaning, and parsing to analysis and visualization. Support is offered through several departments in the library, including Data Services, Digital Scholarship Services, and Digital Library Technology Services. While we can provide advice through consultation and help you deploy research questions in the software, you are responsible for creating the methodological questions and framework.

Questions? Contact us by emailing or fill out our consultation request form.

Why TDM?

Why do TDM?

TDM can help reveal new patterns or information from a large body of work -- leading to the development of new knowledge, of a larger evidence-based practice. TDM enables researchers to analyze thousands of documents and Terabytes of data, allowing for a comprehensive look into research questions.

Kinds of questions TDM can answer?

TDM can help answer such a variety of questions it would be hard to list them all! Some potential use cases are listed below, but do consult the literature of your field to see potentially how TDM is being used to answer the questions of your domain.

  • One of the most common uses of text and data mining is in the analysis of social media data. Social media has a bevy of structured and unstructured data, from which scholars of many kinds can derive interesting information. One might ask, "What do people of XYZ demographic think of ZYX issue?" One way to approach this question is to mine Twitter and do a sentiment analysis.
  • You might also ask, "Are men cited in [domain of study] more than women?" and approach the question by mining bibliographies from papers in a given domain, and looking at gender of authors cited at bulk to get some statistics.
  • You might be in data science and need a lot of data in order to train a classifier for a machine learning project. Mining data from an applicable source (e.g. Flickr for images, scholarly articles from a database, etc.) can provide a lot of training data for a machine.
  • You might want to examine government records of the financial transactions to see if there is any shady donations from lobbyists to government actors. By bulk examining these documents with TDM, you can find outliers or illuminating patterns in financial transactions.