It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.
Text Data Mining
Resources for working with text as data, including corpus preparation, tutorials, data sources, and lists of tools.
The English-Corpora.org online version is comprised of several corpora including: iWeb, the Intelligent Web Corpus; NOW, News on the Web; Coronavirus Corpus; COCA ,Corpus of Contemporary American English; GloWbE, Global Web-based English; Wikipedia Corpus; COHA: Corpus of Historical American English; TV Corpus; Movies Corpus, SOAP Corpus, as well as Corpus del Español and Corpus do Português. The corpora have many different uses, including: finding out how native speakers actually speak and write; finding the frequency of words, phrases, and collocates; looking at language variation and change; e.g. historical, dialects, and genres; gaining insight into culture; for example what is said about different concepts over time and in different countries; designing authentic language teaching materials and resources. To access the corpora as a downloadable set for offline use see the resource "English-Corpora Text-as-Data."
Users must create an account with English-Corpora.org using their NYU emails. Users must also connect using this link at least once every 365 days to retain their account's access.
The Linguistic Data Consortium (LDC) is an open consortium of universities, libraries, corporations and government research laboratories. It was formed in 1992 to address the critical data shortage then facing language technology research and development. LDC creates and distributes a wide array of language resources, including materials used by those engaged in language-related education, research and technology development. Spanning data collections, corpora, software, research papers and specifications, these vital tools aid and inspire scientific progress. Please note that NYU does not currently own all available products available from LDC.
To access LDC data, you must be identified as a member of the NYU Community. To get started, go to the account homepage and create an account using your NYU email. Please be sure to enter "New York University" (without the quotes) into the "Organization" box and then select it as an organization. After you make your account, you will be contacted with next steps.
HathiTrust Digital Library is a digital preservation repository and highly functional access platform. It provides long-term preservation and access services for public domain and in copyright content from a variety of sources, including Google, the Internet Archive, Microsoft, and in-house partner institution initiatives. Bibliographic and full text search are available for all volumes in HathiTrust. Public domain volumes are freely accessible to the public and can be downloaded in their entirety with authentication by persons affiliated with partner institutions. NYU Libraries is a HathiTrust partner institution.
Documenting the American South (DocSouth) is a digital publishing initiative that provides Internet access to texts, images, and audio files related to southern history, literature, and culture. Currently DocSouth includes sixteen thematic collections of books, diaries, posters, artifacts, letters, oral history interviews, and songs.
These datasets were generated in July 2012 (Version 2) and July 2009 (Version 1); we will update these datasets as our book scanning continues, and the updated versions will have distinct and persistent version identifiers (20120701 and 20090715 for the current sets). Each of the numbered links below will directly download a fragment of the corpus. In Version 2 the ngrams are grouped alphabetically (languages with non-Latin scripts were transliterated); in Version 1 the ngrams are partitioned into files of equal size. In addition, for each corpus we provide a file named total_counts, which records the total number of 1-grams contained in the books that make up the corpus. This file is useful for computing the relative frequencies of ngrams.
The Caselaw Access Project (“CAP”) expands public access to U.S. law. Our goal is to make all published U.S. court decisions freely available to the public online, in a consistent format, digitized from the collection of the Harvard Law Library.
CAP includes all official, book-published United States case law — every volume designated as an official report of decisions by a court within the United States.
The scope includes all state courts, federal courts, and territorial courts for American Samoa, Dakota Territory, Guam, Native American Courts, Navajo Nation, and the Northern Mariana Islands. Earliest case is from 1658, and most recent cases are from 2018.
Each volume has been converted into structured, case-level data broken out by majority and dissenting opinion, with human-checked metadata for party names, docket number, citation, and date.
This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation.