Cleaning Data with OpenRefine

About the Data

In this demo we are going to attempt to clean a public data set of museum metadata published by the Powerhouse Museum in Sydney, Australia. Previously, this sample dataset was available directly from the Powerhouse Museum site, but they have since made the collection accessible via API. The sample data reflects a snapshot of the data at a specific time. Since we are trying to demonstrate a lot of features, some of the steps will be very “destructive” to the data (but our original data is still safe).

Download phm-collection.tsv

The data fields are

Creating a project

Projects can be created from data on your computer, data on the web, pasting into the clipboard, connecting to a database, or linking to Google Sheets. As a reminder. OpenRefine never changes your original data, and information is not sent over the Internet!

OpenRefine can accept a variety of data types, including CSV/TSV/*SV, line-based text files, fixed-width field text files, JSON files, RDF files, XML files, and Excel files. I have found that simple *SV files tend to work the best.

OpenRefine data preview page will attempt to identify the file-type, and the available options will vary based on that file-type. For our museum data, OpenRefine correctly identified it as TSV.

openrefine data preview
openrefine menu select

Text Filters

Facets

Editing Data

Transforming Data

Clustering

Creating New Columns

Splitting Columns

Removing Duplicate Rows

Working with Different Data Types

Interacting with Rows

Undo / Redo

Bringing It All Together

Automating Workflows

Exporting a Project or Data Sets

Fetching Data from a URL