Skip to Main Content

Text Data Mining

Resources for working with text as data, including corpus preparation, tutorials, data sources, and lists of tools.

Avocado Research Email Collection

The source data for the collection consisted of Personal Storage Table (PST) files for 282 accounts. A PST file is used by MS Outlook to store emails, calendar entries, contact details, and related information. Data was extracted from the PST files using libpst version 0.6.54. Three files produced no output and and are not included in the collection. Each account is referred to as a "custodian" although some of the accounts do not correspond to humans.

The collection is divided into metadata and text. The metadata is represented in XML, with a single top-level XML file listing the custodians, and then one XML file per custodian listing all items extracted from that custodian's PST files. The full XML tree can be read by loading the top-level file with an XML parser that handles directives. All XML metadata files are encoded in UTF-8. The text contains the extracted text of the items in the custodians' folders, with the extracted text for each item being held in a separate file. The text files are then zipped into a zip file per custodian.

Access Instructions

Acess Restrictions

Different folks at NYU have different access to the AREC Data. Refer to the chart chart to understand to what you have access!

Table of permissions for AREC data access at NYU.
Primary status at NYU Full raw data
Undergraduate student
Undergraduate student - completing a capstone or thesis
Graduate student
Staff
Faculty
Visiting scholar/scholar-in-residence
Non-NYU collaborator

Access Instructions

To get access to the AREC data, please follow these steps:

  1. Write a data management plan using our template.
  2. Agree to the terms of use for AREC data. 
  3. Wait for a reply from a member of Data Services with further instructions on how to access the data.