OpenRefine, formerly Google Refine and before that Freebase Gridworks, is an open source tool that allows users to load data, clean it quickly and accurately, transform it, and even geocode it. The main use of OpenRefine is data processing and transformation to other formats. What’s more, is that all actions that were done on a dataset are stored in a project and can be replayed on another dataset!
Why Use OpenRefine?
One of the most powerful operations that OpenRefine has to offer are facets. When you look at facets for a given column, it shows all unique entries with frequencies. You can use that to get a feel for how consistent your data is. You can also use facets to subset rows that you want to change in bulk. The facet information always appears in the left hand panel in the OpenRefine interface. There are:
Some of the default custom facets are:
Here is an example of a text facet on a column of university names -- note OpenRefine lets us merge these duplicates it found into one:
GREL stands for the Google Refine Expression Language, and it’s a way we can automate changes in OpenRefine. You can use GREL to query APIs, change data formats, split columns, and a whole lot more. OpenRefine lets you choose between GREL, Python or Jython (an implementation of python designed to run on the Java platform), or Clojure (dialect of the Lisp programming language).
You can use GREL to mass-process data, like in this example where I needed to replace URL elements with what they represent (like
%20 = space) in this column of university names:
You can use regular expressions in GREL to powerfully repurpose and redefine your data! A regular expression, regex, is a sequence of characters that define a search pattern. You can even use GREL to call Google Maps API to get lat/longs for datasets where you have addresses. The possibilities with GREL are endless!