Research Guides: Data Sources: Textual Data

Text Data Mining

Tools
Methods

SpaCy
spaCy is the best way to prepare text for deep learning. It interoperates seamlessly with TensorFlow, Keras, Scikit-Learn, Gensim and the rest of Python's awesome AI ecosystem.
Natural Language Toolkit (NLTK)
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.
gensim
gensim is a Python module for topic modeling. It is a robust, efficient and hassle-free piece of software to realize unsupervised semantic modeling from plain text.
MALLET
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
ABBYY Finereader
ABBYY Finereader is an OCR tool to digitize print based text and data tables. NYU supports this tool in the Digital Studio.
JGAP: a Java-based graph algorithms platform
JGAP is a web-based platform for designing and implementing Java-coded graph algorithms. The platform contains a library of common data structures for implementing graph algorithms, features a "plug-and-play" modular design for adding new algorithm modules, and includes a performance meter to measure the execution time of implemented algorithms.

Textual Data Archives

HathiTrust
HathiTrust Digital Library is a digital preservation repository and highly functional access platform. It provides long-term preservation and access services for public domain and in copyright content from a variety of sources, including Google, the Internet Archive, Microsoft, and in-house partner institution initiatives. Bibliographic and full text search are available for all volumes in HathiTrust. Public domain volumes are freely accessible to the public and can be downloaded in their entirety with authentication by persons affiliated with partner institutions. NYU Libraries is a HathiTrust partner institution.

Documenting the American South
Documenting the American South (DocSouth) is a digital publishing initiative that provides Internet access to texts, images, and audio files related to southern history, literature, and culture. Currently DocSouth includes sixteen thematic collections of books, diaries, posters, artifacts, letters, oral history interviews, and songs.

English-Corpora.org
The English-Corpora.org online version is comprised of several corpora including: iWeb, the Intelligent Web Corpus; NOW, News on the Web; Coronavirus Corpus; COCA ,Corpus of Contemporary American English; GloWbE, Global Web-based English; Wikipedia Corpus; COHA: Corpus of Historical American English; TV Corpus; Movies Corpus, SOAP Corpus, as well as Corpus del Español and Corpus do Português. The corpora have many different uses, including: finding out how native speakers actually speak and write; finding the frequency of words, phrases, and collocates; looking at language variation and change; e.g. historical, dialects, and genres; gaining insight into culture; for example what is said about different concepts over time and in different countries; designing authentic language teaching materials and resources. To access the corpora as a downloadable set for offline use see the resource "English-Corpora Text-as-Data."

Users must create an account with English-Corpora.org using their NYU emails. Users must also connect using this link at least once every 365 days to retain their account's access.
ProQuest TDM Studio
ProQuest TDM Studio allows research teams at NYU the ability to mine large volumes of published content from the millions of pages of news, scholarly and other publications provided to the campus community through current ProQuest subscriptions.
Linguistic Data Consortium (LDC)
The Linguistic Data Consortium (LDC) is an open consortium of universities, libraries, corporations and government research laboratories. It was formed in 1992 to address the critical data shortage then facing language technology research and development. LDC creates and distributes a wide array of language resources, including materials used by those engaged in language-related education, research and technology development. Spanning data collections, corpora, software, research papers and specifications, these vital tools aid and inspire scientific progress. Please note that NYU does not currently own all available products available from LDC.

To access LDC data, you must be identified as a member of the NYU Community. To get started, go to the account homepage and create an account using your NYU email. Please be sure to enter "New York University" (without the quotes) into the "Organization" box and then select it as an organization. After you make your account, you will be contacted with next steps.

Chronicling America: Historic American Newspapers
Search America's historic newspaper pages from 1789-1925 or use the U.S. Newspaper Directory to find information about American newspapers published between 1690-present.
Cultural Analytics Dataverse (McGill University)
This dataverse is a small collection of individual corpora produced or explored by the txt lab at McGill University.
Project Gutenberg
Project Gutenberg is a repository of ebooks that can be downloaded as text.You will find the world's great literature here, especially older works for which copyright has expired.
Google nGram texts
These datasets were generated in July 2012 (Version 2) and July 2009 (Version 1); we will update these datasets as our book scanning continues, and the updated versions will have distinct and persistent version identifiers (20120701 and 20090715 for the current sets). Each of the numbered links below will directly download a fragment of the corpus. In Version 2 the ngrams are grouped alphabetically (languages with non-Latin scripts were transliterated); in Version 1 the ngrams are partitioned into files of equal size. In addition, for each corpus we provide a file named total_counts, which records the total number of 1-grams contained in the books that make up the corpus. This file is useful for computing the relative frequencies of ngrams.

Individual Corpora

Enron Emails
This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation.

Data Sources

General Information

Service Desk and Chat

Text Data Mining

Textual Data Archives

Individual Corpora