Demo: Digital Mitford Data
In this demo we are going to play with a data set extracted from Digital Mitford. The original data was a TEI encoded index of names used on the site. The
Download histperson.tsv
The data fields are
- xml id
- gender
- name
- occupation
- viaf id
Navigating OpenRefine
Creating a project
- check character encoding, options
- Refine never over writes your original data, it creates a copy!
- information is not sent over internet
Manipulating columns
- renaming and removing columns
- changing column order
- collapsing columns
Exporting a project or a data set
- OpenRefine project
- many formats!
- templating
- export is always a new copy of data, never alters original!
Automating tasks
- Undo/Redo copy
Extractto txt file (use text editor, not Word) - create new project with original file
- Undo/Redo paste saved extract into
Apply
Exploring and Cleaning Data
Cleaning the simple stuff
- a lot of options in the drop-down menus
- get rid of white space
- one column at a time:
Edit cells>Common transforms - entire sheet:
All>Transform> value.trim() and value.replace(/\s+/,’ ‘)
- one column at a time:
Edit cells>Transform...is very powerful- GREL = General Refine Expression Language
- GREL documentation and recipes are available on the OpenRefine wiki.
Sample GREL Recipes
- Remove duplicate comma separated entries in a cell
- value.split(“, “).uniques().join(“, “)
- Replace string in cells
- value.replace(“+”, “”)
- value.replace(“~””, “”).replace(“,”, “”).replace(“-“, “”)
- Clean-up character encoding problems
- value.unescape(“url”)
- Convert number with text to number
- toNumber(value.replace(“ million”, “”))*1000000
Splitting, faceting, and clustering
- multi-valued fields can be a barrier to data cleaning
Edit cells>Split multi-valued cells...- record view vs. row view
Facet>Text facet- manual cleaning and clustering
Edit cells>Join multi-valued cells
Enhancing with Data from Other Sources
Reconciling from other data sources
- Vocabulary reconciliation is a process where automated systems use terms from unstandardized metadata to search controlled vocabularies and return URIs.
- OpenRefine has built in tools to reconcile data with Wikidata
- Other data services can be added
OpenRefine’s Wikidata Service
- Reconciling the names
Reconcile>Start reconciling...- choose
Wikidata Reconciliation for OpenRefine (en) - choose
humanandAuto-match candidates with high confidence - matches some automatically, but often requires some manual review
- OpenRefine 2.8 added querying and extracting tools
- Select
matchedfrom the judgement facet Edit column>Add columns from reconciled values- Add country of citizenship, occupation, place of birth, place of death, place of burial, and VIAF ID
Add Propertyfilter to include date of birth and date of death
- Select
- Review the results of your reconciliation
- date of birth >
Facet>Timeline facet
- date of birth >
- Extract the Wikidata id
Edit column>Add column based on this column...- name column: wikidata_id
- cell.recon.match.id
Adding more data based on extracted dataset
- Geographic Coordinates for places
- place of birth >
Edit column>Add columns from reconciled values - Add coordinate location
- place of birth >
Compare the collected and extracted VIAF ids
- Clean up collected VIAF ids
Facet>Customized facets>Facet by blank- On rows view, choose
falseto select rows with viaf ids Filter- regex: ^(?!http://).+- Transform cells to remove final “/” temporarily - value.replace(/\/$/, “”)
- Transform cells to apply final “/” because that is what VIAF expects - value+”/”
- Make extracted VIAF id numbers into VIAF URLs
Edit column>Add column based on this column...- name column: viaf_url
- “http://http://viaf.org/viaf/” + value + “/”
- Build a facet to compare the two columns
- value == cells[“viaf id”].value
Other data services
- Users can set up their own data services or use other existing data services.
- Some sample services are available at the OpenRefine Wiki - Reconcilable Data Sources and at http://refine.codefork.com/.
- Reconciliation can be taxing on host servers and data sources. Documentation for hosting your own service are available at and on the conciliator GitHub.
Exporting your cleaned and expanded dataset
- Export project
- TSV, CSV, HTML table, Excel (two flavors), ODF spreadsheet
- Templating - Create XML exports!