Research Guides: Text Data Mining: Data Sources

Textual Data Archives

English-Corpora.org
The English-Corpora.org online version is comprised of several corpora including: iWeb, the Intelligent Web Corpus; NOW, News on the Web; Coronavirus Corpus; COCA ,Corpus of Contemporary American English; GloWbE, Global Web-based English; Wikipedia Corpus; COHA: Corpus of Historical American English; TV Corpus; Movies Corpus, SOAP Corpus, as well as Corpus del Español and Corpus do Português. The corpora have many different uses, including: finding out how native speakers actually speak and write; finding the frequency of words, phrases, and collocates; looking at language variation and change; e.g. historical, dialects, and genres; gaining insight into culture; for example what is said about different concepts over time and in different countries; designing authentic language teaching materials and resources. To access the corpora as a downloadable set for offline use see the resource "English-Corpora Text-as-Data."
Users must create an account with English-Corpora.org using their NYU emails. Users must also connect using this link at least once every 365 days to retain their account's access.
Linguistic Data Consortium (LDC)
The Linguistic Data Consortium (LDC) is an open consortium of universities, libraries, corporations and government research laboratories. It was formed in 1992 to address the critical data shortage then facing language technology research and development. LDC creates and distributes a wide array of language resources, including materials used by those engaged in language-related education, research and technology development. Spanning data collections, corpora, software, research papers and specifications, these vital tools aid and inspire scientific progress. Please note that NYU does not currently own all available products available from LDC.
To access LDC data, you must be identified as a member of the NYU Community. To get started, go to the account homepage and create an account using your NYU email. Please be sure to enter "New York University" (without the quotes) into the "Organization" box and then select it as an organization. After you make your account, you will be contacted with next steps.
HathiTrust
HathiTrust Digital Library is a digital preservation repository and highly functional access platform. It provides long-term preservation and access services for public domain and in copyright content from a variety of sources, including Google, the Internet Archive, Microsoft, and in-house partner institution initiatives. Bibliographic and full text search are available for all volumes in HathiTrust. Public domain volumes are freely accessible to the public and can be downloaded in their entirety with authentication by persons affiliated with partner institutions. NYU Libraries is a HathiTrust partner institution.

Caselaw Access Project

The Caselaw Access Project (“CAP”) expands public access to U.S. law. Our goal is to make all published U.S. court decisions freely available to the public online, in a consistent format, digitized from the collection of the Harvard Law Library.
CAP includes all official, book-published United States case law — every volume designated as an official report of decisions by a court within the United States.
The scope includes all state courts, federal courts, and territorial courts for American Samoa, Dakota Territory, Guam, Native American Courts, Navajo Nation, and the Northern Mariana Islands. Earliest case is from 1658, and most recent cases are from 2018.
Each volume has been converted into structured, case-level data broken out by majority and dissenting opinion, with human-checked metadata for party names, docket number, citation, and date.
Chronicling America: Historic American Newspapers
A considerable corpus of historic newspapers along with some near-contemporary publication dates, also accessible via a user-friendly API.
Ciphers of The Times
Agony Columns - A corpus of 650k sentences from scraped from the Agony Column of The Times between 1860 and 1879. Victorian Novels - 25+ million words from a corpus of 220 Victorian novels (1800-1920).
Cultural Analytics Dataverse (McGill University)
This dataverse is a small collection of individual corpora produced or explored by the txt lab at McGill University.
Documenting the American South
Documenting the American South (DocSouth) is a digital publishing initiative that provides Internet access to texts, images, and audio files related to southern history, literature, and culture. Currently DocSouth includes sixteen thematic collections of books, diaries, posters, artifacts, letters, oral history interviews, and songs.
Google nGram texts
These datasets were generated in July 2012 (Version 2) and July 2009 (Version 1); we will update these datasets as our book scanning continues, and the updated versions will have distinct and persistent version identifiers (20120701 and 20090715 for the current sets). Each of the numbered links below will directly download a fragment of the corpus. In Version 2 the ngrams are grouped alphabetically (languages with non-Latin scripts were transliterated); in Version 1 the ngrams are partitioned into files of equal size. In addition, for each corpus we provide a file named total_counts, which records the total number of 1-grams contained in the books that make up the corpus. This file is useful for computing the relative frequencies of ngrams.
Hathi Trust Research Center Analytics
The HTRC Analytics supports large-scale computational analysis of the works in the HathiTrust Digital Library to facilitate non-profit and educational research.
Internet Archive Scholar
Requestable full-text datasets derived from the content of Internet Archive, which includes public website content, digitized print, and material in archive.org.
JSTOR Text Analysis Support
JSTOR text analysis support accommodates text analysis and digital humanities research by providing datasets of full-text for journals, books, research reports, and pamphlets on JSTOR. Content available for text analysis includes most journals and Open Access books on JSTOR, as well as most pamphlets and research reports. In addition, OCR-enabled items from Reveal Digital’s open collections are also included.
Resources for text analysis:
-Working with JSTOR Bibliographic Metadata
-Working with JSTOR Full-Text Datasets
- Select class and webinar recordings on the Constellate YouTube channel
- Notebooks and tutorials in the Constellate GitHub repository
- Past Text Analysis Pedagogy (TAP) Institute GitHub repositories and notebooks
PLOS Full Corpus of Published Articles
JATS-standard XML content of every PLOS article, including all Articles and Front Matter, available for download and use.
Project Gutenberg
Project Gutenberg is a repository of ebooks that can be downloaded as text.You will find the world's great literature here, especially older works for which copyright has expired.

Individual Corpora

Enron Emails
This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation.

Dictionaries and Corpora for Cleaning

Data Munging resources from Ted Underwood and HathiTrust
Commonly found English-language tokens, common OCR correction lists, and other resources for cleaning and preparing a corpus for text mining.
U.S. Census 1990 frequently occurring surnames
A resource to aid named entity recognition and other corpus preparation steps.
NLTK Corpora
Predominantly English-language resources, some other than English.
BYU language corpora resources for cleaning and prep
Various corpora organized by genre and source, often taken from very large written corpora such as Wikipedia.

CC

Original work in this LibGuide is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Text Data Mining

General Information

Service Desk and Chat

CC BY-NC

Textual Data Archives

Individual Corpora

Dictionaries and Corpora for Cleaning

CC