Research Guides: Research Data Management: OpenRefine

OPENREFINE

Open Refine text underneath a blue diamond, forming the OpenRefine logo OpenRefine, formerly Google Refine and before that Freebase Gridworks, is an open source tool that allows users to load data, clean it quickly and accurately, transform it, and even geocode it. The main use of OpenRefine is data processing and transformation to other formats. What’s more, is that all actions that were done on a dataset are stored in a project and can be replayed on another dataset!

Why Use OpenRefine?

Simple installation
Extensive documentation
Lots of great import formats: TSV, CSV, XML, RDF Triples, JSON, Google Sheets, Excel
Upload from local drive or import from URL
Many export formats: TSV, CSV, Excel, HTML table
Works with large-ish datasets (100,000 rows). Can adjust memory allocation to accommodate larger datasets.
Data remains on your computer, so nothing is shared until you choose to share it.
Useful extensions: geoXtension, Opentree for phylogenetic trees from Open Tree of Life, and many more (listed here, scroll to ‘extensions’)!
Active development community

FACETS WITH OPENREFINE

One of the most powerful operations that OpenRefine has to offer are facets. When you look at facets for a given column, it shows all unique entries with frequencies. You can use that to get a feel for how consistent your data is. You can also use facets to subset rows that you want to change in bulk. The facet information always appears in the left hand panel in the OpenRefine interface. There are:

Numeric facets
Timeline facets (for dates)
Custom facets
Scatterplot facets

Some of the default custom facets are:

Word facet - this breaks down text into words and counts the number of records each word appears in
Duplicates facet - this results in a binary facet of ‘true’ or ‘false’. Rows appear in the ‘true’ facet if the value in the selected column is an exact match for a value in the same column in another row
Text length facet - creates a numeric facet based on the length (number of characters) of the text in each row for the selected column. This can be useful for spotting incorrect or unusual data in a field where specific lengths are expected (e.g. if the values are expected to be years, any row with a text length more than 4 for that column is likely to be incorrect)
Facet by blank - a binary facet of ‘true’ or ‘false’. Rows appear in the ‘true’ facet if they have no data present in that column. This is useful when looking for rows missing key data.

Here is an example of a text facet on a column of university names -- note OpenRefine lets us merge these duplicates it found into one:

GOOGLE REFINE EXPRESSION LANGUAGE

GREL stands for the Google Refine Expression Language, and it’s a way we can automate changes in OpenRefine. You can use GREL to query APIs, change data formats, split columns, and a whole lot more. OpenRefine lets you choose between GREL, Python or Jython (an implementation of python designed to run on the Java platform), or Clojure (dialect of the Lisp programming language).

You can use GREL to mass-process data, like in this example where I needed to replace URL elements with what they represent (like %20 = space) in this column of university names:

You can use regular expressions in GREL to powerfully repurpose and redefine your data! A regular expression, regex, is a sequence of characters that define a search pattern. You can even use GREL to call Google Maps API to get lat/longs for datasets where you have addresses. The possibilities with GREL are endless!

CC

Original work in this LibGuide is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Research Data Management

General Information

Service Desk and Chat

RELEVANT SLIDES

OPENREFINE

FACETS WITH OPENREFINE

GOOGLE REFINE EXPRESSION LANGUAGE

CC