Skip to Main Content

Research Data Management

Information on best practices and standards for data and code management.

OPENREFINE

OpenRefine, formerly Google Refine and before that Freebase Gridworks, is an open source tool that allows users to load data, clean it quickly and accurately, transform it, and even geocode it. The main use of OpenRefine is data processing and transformation to other formats. What’s more, is that all actions that were done on a dataset are stored in a project and can be replayed on another dataset!

Why Use OpenRefine?

  • Simple installation
  • Extensive documentation
  • Lots of great import formats: TSV, CSV, XML, RDF Triples, JSON, Google Sheets, Excel
  • Upload from local drive or import from URL
  • Many export formats: TSV, CSV, Excel, HTML table
  • Works with large-ish datasets (100,000 rows). Can adjust memory allocation to accommodate larger datasets.
  • Data remains on your computer, so nothing is shared until you choose to share it.
  • Useful extensions: geoXtension, Opentree for phylogenetic trees from Open Tree of Life, and many more (listed here, scroll to ‘extensions’)!
  • Active development community

FACETS WITH OPENREFINE

One of the most powerful operations that OpenRefine has to offer are facets. When you look at facets for a given column, it shows all unique entries with frequencies. You can use that to get a feel for how consistent your data is. You can also use facets to subset rows that you want to change in bulk. The facet information always appears in the left hand panel in the OpenRefine interface. There are:

  • Numeric facets
  • Timeline facets (for dates)
  • Custom facets
  • Scatterplot facets

Some of the default custom facets are:

  • Word facet - this breaks down text into words and counts the number of records each word appears in
  • Duplicates facet - this results in a binary facet of ‘true’ or ‘false’. Rows appear in the ‘true’ facet if the value in the selected column is an exact match for a value in the same column in another row
  • Text length facet - creates a numeric facet based on the length (number of characters) of the text in each row for the selected column. This can be useful for spotting incorrect or unusual data in a field where specific lengths are expected (e.g. if the values are expected to be years, any row with a text length more than 4 for that column is likely to be incorrect)
  • Facet by blank - a binary facet of ‘true’ or ‘false’. Rows appear in the ‘true’ facet if they have no data present in that column. This is useful when looking for rows missing key data.

Here is an example of a text facet on a column of university names -- note OpenRefine lets us merge these duplicates it found into one:

OpenRefine screenshot of clustering

GOOGLE REFINE EXPRESSION LANGUAGE

GREL stands for the Google Refine Expression Language, and it’s a way we can automate changes in OpenRefine. You can use GREL to query APIs, change data formats, split columns, and a whole lot more. OpenRefine lets you choose between GREL, Python or Jython (an implementation of python designed to run on the Java platform), or Clojure (dialect of the Lisp programming language).

You can use GREL to mass-process data, like in this example where I needed to replace URL elements with what they represent (like %20 = space) in this column of university names:

Adding a column based on a column in OpenRefine with GREL

You can use regular expressions in GREL to powerfully repurpose and redefine your data! A regular expression, regex, is a sequence of characters that define a search pattern.  You can even use GREL to call Google Maps API to get lat/longs for datasets where you have addresses. The possibilities with GREL are endless!

CC

Creative Commons License
Original work in this LibGuide is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.