Skip to Main Content

Research Data Management

Information on best practices and standards for data and code management.

FILE FORMAT SELECTION

Ideally, file types for a project should be standard, non-proprietary, and open source. If these features are not possible, at the very least file format selection should be made with sustainability and long-term use in mind. Try opening a Windows 95 Word Document on your modern computer, and you'll understand why (hint: you will get only wingdings)!

Many software you will have to use often relies on proprietary file formats that do not last long as new versions are created, or tools lose relevance. Where possible, export data files to stable formats for long-term access to your data, or convert proprietary files into equivalent standardized files that will be able to represent that data (like going from .xlsx to .csv).

A proprietary format can refer to:

  • a file format that contains data that is ordered and stored according to a particular encoding-scheme, designed by the authors to be secret
    • The secrecy means that specific hardware and software (designed and sold by the authors) can interpret the format better than others (like opening a .psd file in PhotoShop is more seamless than in Glimpse)
  • a file format that is openly documented but whose use is restricted through licenses

An open format is:

  • An open format is one that is platform independent, machine readable, and made available to the public without restrictions that would impede the re-use of that information". -- Open Government Directive
  • a file format defined by a published specification maintained by a standards organization, and which has no restrictions on its usage (e.g. not restricted by copyright)
    • There are no restrictions on the type of software or hardware that can use these by design (like how a .csv can be used by Google Sheets, Excel, and LibreOffice Calc)

Examples of sustainable formats

Long-term formats for data

Text

  • XML (.xml)

  • HTML (.htm)

  • OpenDocument Format (e.g. OpenDocument Text, .odt)

  • Plain text (.txt)

  • Markdown and other human-readable markup languages deploying plain-text editing

Tabular

Media

  • Uncompressed TIFF (.tif)

  • JPEG 2000 (.mj2)

  • MPEG-4 (.mp4)

  • Free Lossless Audio Codec (.flac)

Geospatial

  • ESRI Shapefiles and supporting files (.shp, .shx, .dbf, .prj, .sbx, .sbn)

  • KML (.kml)

  • GML (.gml)

  • GeoTIFF (.tif, .tfw)

Mid-term formats for data

Text

  • PDF/A

Statistical

  • SPSS portable format (.por)

  • R file formats, i.e. script files (.R) data (.Rda, .Rdata) or markdown files (.Rmd)

  • Stata file formats, i.e. do-files (.do) and data files (.dta)

  • SAS file formats (.sas, .xpt, etc.)

Media

  • JPEG (.jpeg, .jpg)

  • MP3 (.mp3)

  • Photoshop files (.psd)

Geospatial

Encoding

Where possible given the limits of file formatting, encoding should be done using the Unicode system (UTF-8 or UTF-16), or using the older ASCII system that has been incorporated into Unicode.

CC

Creative Commons License logo.
Original work in this LibGuide is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.