Skip to Main Content

Text Data Mining

Resources for working with text as data, including corpus preparation, tutorials, data sources, and lists of tools.

ProQuest Vogue Magazine Text-as-Data

The ProQuest Vogue Magazine text-as-data collection consists of approximately 451,000 XML files with accompanying JPEGs capturing the full text, page image, and a number of metadata fields for Vogue Magazine covering the years 1892 to 2016. Note that this collection is not updated and newer issues are not being added at this time. For a more complete access to the full text of ProQuest's holdings, including more recent issues, consider using ProQuest TDM Studio access.

The collection is available to members of the NYU community only.

Each XML file consists of the metadata, identifiers, and full text of one article within the magazine issue. JPEG files consist of one page image per file. The XML files are arranged by year of publication and zipped into a year-by-year grouping. JPEGs are further organized into month/issue. A README file is included with the collection with further details about the file structure.

To access the collection, use Research Workspace. NYU researchers with a valid netID can mount the cloud-based ds_collections Research Workspace share on any local computer on the NYU network (i.e. on campus or on NYU VPN if off campus). Follow the instructions for how to access Research Workspace, using ds_collections as the project name. The collection will be found at ds_collections/proquest/proquest_vogue.

Questions?

Contact data.services@nyu.edu for questions about this data source.