Research Guides: Text Data Mining: ProQuest Historical Newspapers TDM

ProQuest Historical Newspapers Text-as-Data Collection

The ProQuest Historical Newspapers text-as-data collection consists of approximately 99.1 million XML files capturing the full text and a number of metadata fields for 26 newspaper titles found in the ProQuest historical newspapers database collection. The titles and years of coverage found in this collection are:

American Hebrew & Jewish Messenger, 1857-1922
American Israelite, 1854-1927
Atlanta Daily World, 1931-2003
Atlanta Constitution, 1868-1933
Baltimore Afro-American, 1893-1988
Boston Globe, 1872-1985
Chicago Tribune, 1849-1933
Christian Science Monitor, 1908-1994
Guardian and the Observer, 1791-1909
Hartford Courant, 1764-1931
Irish Times and Weekly Irish Times, 1859-1930
Jewish Advocate, 1905-1990
Jewish Exponent, 1887-1990
Norfolk Journal & Guide, 1921-2003
Los Angeles Times, 1881-1933
New York Amsterdam News, 1922-1993
New York Times with Index, 1851-1937
New York Tribune/Herald Tribune, 1841-1962
Newsday, 1940-1989
Philadelphia Tribune, 1912-2001
Pittsburgh Courier, 1911-2002
San Francisco Chronicle, 1865-1922
Times of India, 1838-2008
Wall Street Journal, 1889-1936
Washington Post, 1877-1935
ProQuest's Chinese Newspaper Collection, 1832-1953

Note that this collection is not updated and additional titles or newer issues are not being added at this time. For a more complete access to the full text of ProQuest's newspaper holdings, consider using ProQuest TDM Studio access.

The collection is available to members of the NYU community only.

Each XML file consists of the metadata, identifiers, and full text of one article within a single newspaper issue. The files are arranged on the access points below by newspaper title, and within each newspaper directory, in multiple zipped files (.zip format). Each compressed file contains multiple XML files. A README file is included with the collection with further details about the file structure.

Access Instructions

There are two access points:

Research Workspace: NYU researchers with a valid netID can mount the cloud-based ds_collections Research Workspace share on any local computer on the NYU network (i.e. on campus or on NYU VPN if off campus). Follow the instructions for how to access Research Workspace, using ds_collections as the project name. The collection will be found at ds_collections/proquest/proquest_hnp.
NYU High Performance Computing: The files can be found and used directly for batch jobs on the NYU HPC at /scratch/work/public/proquest/proquest_hnp. To request an HPC account (faculty sponsorship is required), visit the HPC homepage.

About the data

You can preview the data quality and availability via the dashboard below. OCR quality was scored on a per-article basis, based on the percentage of the tokens in an article that are in the large curated corpus of attested words found in the HathiTrust digital library.

Additional documentation related to this collection is available on the guide prepared by ProQuest.

CC

Original work in this LibGuide is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Text Data Mining

General Information

Service Desk and Chat

ProQuest Historical Newspapers Text-as-Data Collection

Access Instructions

About the data

CC