Research Guides: Linguistics: Primary Materials and Data

Geographic distributions

Ethnologue: Languages of the World
Ethnologue: Languages of the World contains data for finding, reading about, and researching all 7,097 known living languages in the world.
Dictionary of American Regional English (DARE)
The Dictionary of American Regional English (DARE) represents American regional vocabulary, from Adam’s housecat to Zydeco. It is consulted by scholars and lovers of language and regional nuance. This digital version transforms the dictionary into an interactive, multimedia tool for both scholarly inquiry and general intellectual curiosity.

The World Atlas of Language Structures
The World Atlas of Language Structures (WALS) is a large database of structural (phonological, grammatical, lexical) properties of languages gathered from descriptive materials (such as reference grammars) by a team of 55 authors.

Enhanced Electronic Grammars Online
Enhanced Electronic Grammars (EEG) features comprehensive descriptions of languages from around the world. This online resource makes full grammars available together in an interlinked and semantically-annotated format, allowing granular access to the grammatical data and enabling cross-language research of several grammars at the same time. In addition to cross-linguistic queries, each grammar can also be read and researched individually. The electronic format allows for multimodal enhancement of language descriptions, such as audio and video supplements. The database is updated biannually, integrating several new grammar publications each year for even more extensive cross-linguistic research.
Linguistic Minorities in Europe Online
Linguistic Minorities in Europe Online (LME) provides comprehensive documentation of indigenous and immigrant linguistic minorities in Europe. Currently, LME contains articles on Basque, Breton, Croatian, Frisian, Hungarian, the Sámi languages, and Turkish as minority languages.

Corpora materials

These resources provide access to linguistic corpora or other materials that may be valuable for corpus-based work.

Linguistic Data Consortium (LDC)
The Linguistic Data Consortium (LDC) is an open consortium of universities, libraries, corporations and government research laboratories. It was formed in 1992 to address the critical data shortage then facing language technology research and development. LDC creates and distributes a wide array of language resources, including materials used by those engaged in language-related education, research and technology development. Spanning data collections, corpora, software, research papers and specifications, these vital tools aid and inspire scientific progress. Please note that NYU does not currently own all available products available from LDC.

To access LDC data, you must be identified as a member of the NYU Community. To get started, go to the account homepage and create an account using your NYU email. Please be sure to enter "New York University" (without the quotes) into the "Organization" box and then select it as an organization. After you make your account, you will be contacted with next steps.
English-Corpora.org
The English-Corpora.org online version is comprised of several corpora including: iWeb, the Intelligent Web Corpus; NOW, News on the Web; Coronavirus Corpus; COCA ,Corpus of Contemporary American English; GloWbE, Global Web-based English; Wikipedia Corpus; COHA: Corpus of Historical American English; TV Corpus; Movies Corpus, SOAP Corpus, as well as Corpus del Español and Corpus do Português. The corpora have many different uses, including: finding out how native speakers actually speak and write; finding the frequency of words, phrases, and collocates; looking at language variation and change; e.g. historical, dialects, and genres; gaining insight into culture; for example what is said about different concepts over time and in different countries; designing authentic language teaching materials and resources. To access the corpora as a downloadable set for offline use see the resource "English-Corpora Text-as-Data."

Users must create an account with English-Corpora.org using their NYU emails. Users must also connect using this link at least once every 365 days to retain their account's access.
ProQuest Historical Newspapers Text-as-Data Collection
ProQuest Historical Newspapers Text-as-Data Collection consists of machine-readable XML files containing metadata and full-text content for 26 historical newspapers spanning the eighteenth to twentieth centuries. The collection is available to NYU faculty and students.

An overview of newspaper, time coverage, and access points can be found on the NYU Libraries guide to the collection.