OpenART Technology Choices

During the course of the OpenART project we’ve come across a number of different technologies and standards that we could use. In this post we aim to go into the ones we’ve used and found useful.

We aim to answer the following series of questions:

What technologies, frameworks, standards are you using in your project? What are your impressions of them? What difficulties have you encountered? What approaches and techniques have worked well? What advice would you give to others engaging with them?

“What technologies, frameworks, standards are you using in your project?”

Data preparation and mapping/triplification/lifting

The source information was supplied as a set of Excel spreadsheets containing semi-structured data. These spreadsheets were provided by the researcher and were the “live” data capture mechanism for the transcribed source data. As such they were subject to structural changes during the lifetime of the project. Various tools and approaches for manipulation of this source data were explored.

  • Relational Database (RDBMS)
    • Early in the project an approach of migrating the data to an RDBMS was explored
    • This resulted in a cleaner version of the source information, but the approach was abandoned as the spreadsheets were a live tool
    • The RDBMS ontology mapping plugin for Protege was explored for transfer of RDBMS data to RDF/OWL
  • Excel Spreadsheet manipulation
    • Google Refine was used for very quickly visualising and manipulating source information
    • RDF Extension for Google Refine was used for exporting cleaned up data held in Refine to test versions of RDF linked data, and for for matching nodes to validate and connect parts of the data (via the reconciliation feature)
    • GNU SED (Stream Editor) is a powerful general purpose text manipulation tool. This hits the cleaning spots the developers can’t reach (quickly) in the visual Refine environment.

Ontology development

  • Protege version 4.1 was used for ontology development. Protege is a powerful and fully featured ontology IDE (integrated development environment)

Linked Data manipulation and production

  • Rapper (part of the Raptor RDF Parser toolkit) was used for translating between various RDF serialisations. Rapper is a parsing and serialising utility for RDF.
  • SPARQL (A standard query language for RDF) for querying and checking dtata
    • Rasqal via the online service at Triplr. Rasqal provides RDF querying capabilities, including SPARQL queries
    • ARQ, a SPARQL query engine for Jena

Hosting and base systems

  • openSUSE Linux, a major Linux distribution, was used as a base operating system
  • SuseStudio was used to build and deploy bespoke virtual machines to Amazon EC2. SuseStudio allows you to choose packages to build the virtual machine and will directly deploy to an Amazon web service account all within a web-based interface
  • Amazon Web Services (AWS) was used for hosting, particularly the Elastic Compute Cloud (EC2) service for virtualisation
  • The Apache web server with mod_rewrite was used for web hosting of ontologies and data, with content negotiation
  • OntologyBrowser was used for for storing, viewing, manipulating and accessing the ontology, using a web front end and REST API

“What difficulties have you encountered?”

Source data manipulation and conversion

  • The source information was being actively worked upon during the lifetime of the project. This presented difficulties as working on the ontology and triplification to RDF concurrently with changes in the source information made decision-making and coordination difficult
  • The source information was provided as Excel spreadsheets. These were used as an organisational environment for capturing data by the researcher, and provided free text search, categorisation and a basic namiing scheme for entities. Although a degree of rigour was applied, the environment could not be described as “structured data” such as that provided by a database, and this provided challenges in interpreting and modelling the information.
  • Originally the project assumed that hosting of the information as part of the Court, Country, City project would provide a structured source for the information; however the requirements from this project were sufficiently divergent from the OpenART project’s requirements for this not to be the case. As a result the project did not consider alternative automated processes to assist in source data manipulation and triplification until late in the project.

Ontology development tooling

  • Bulk changes in the ontology using Protege was difficult to do efficiently
  • Synchronisation between environments was a challenge with new versions of ontologies

Ontology development informational issues

  • Difficulties in extending the combined ontologies – particularly when trying to move from earlier versions expressing “simplifications” to a more complex framework later in the project. It would have been easier to start with a more complex framework initially rather than trying to extend the earlier version.
  • Naming conventions became very cumbersome, for example when trying to create inverse properties that sometimes had no natural or simple English language fit

“What are your impressions of them?”

  • Protege and RDF tools used locally were found to be very mature. Protege is now mature enough to be usable by newcomers; and also provides a route to further development as it is based on an OWL-API. However it was difficult trying to reason with anything other than a small set of data points.
  • Google Refine was surprisingly efficient and workable. Additional functionality such the as reuse of RDF mapping templates was useful.
  • Cloud services (Amazon EC2) were relatively easy to use; and the combination of openSUSE and SuseStudio provided an efficient and repeatable mechanism for deploying virtual machines
  • OntologyBrowser presents a nice iinterfacae for ontology browsing in HTML. Its user-friendly syntax for ontology axioms means that you don’t need to be an OWL or logic expert. However it is very “data-centric” and not an end-user environment.

“What approaches and techniques have worked well?”

  • Google Refine is good for rapid development when a non-trivial ontology requires that mappings and ontology need to be developed in tandem
  • Use of Protege (see above)
  • Use of cloud services (see above)

“What advice would you give to others engaging with them?”

  • Investigate using a collaborative ontology environment, such as knoodl
  • Invest a small amount of time in collaborative environments, this can lead to a big payoff; cloud services are relatively easy to use
  • Start with an ontology framework complex-enough to represent the modelling required; it can be difficult moving from a simple to a more complex version
Author: Martin Dow
Advertisements

Leave a Comment

OpenART – Some Wins and Fails

Win! Researcher on hand to explain the data and answer questions.

Win! Openness to being open with the data.

Win! Increased understanding of open data and ontologies for the domain.

Win! Ready-made expertise on the team.

Win! Google Refine as a quick way of experimenting with spreadsheet data and getting RDF out of spreadsheets.

Win! Indexing in SINDICE should be a quick win.

Fail! The ontology took longer to create than we anticipated.

Fail! The data is complex and still a work in progress, the spreadsheets memory-hungry and in need of some cleanup and post-processing.

Fail! Lots of re-visiting and round-tripping slows things down.

Fail! There is a gap between the precision needed for an ontology and the working spreadsheets of a researcher.

Leave a Comment

OpenART – Costs and Benefits

OpenART – Costs and Benefits

One of the blog posts required for OpenART project is around ‘costs and benefits’: “This should be a very rough estimation of how much it has cost you in terms of time and resources to make your data openly available. What do you expect the benefits to be? And how do these 2 assessments balance out against each other?”.

Ontology Development

It’s taken around 10 days to develop the ontology in it’s current state. Bearing in mind this has been created by someone with a high level of knowledge and expertise in ontologies, related technologies and tooling, someone who already knows of existing relevant ontologies and could rapidly prototype an approach. A quicker and simpler approach would have been to ‘mix and match’ by selecting properties from a range of existing ontologies, without going so far as to develop a dedicated ontology. A longer and more complex approach would have been to experiment with different ontology approaches to establish the best and richest solution, eg. descriptions and situations. Given the time available we opted for the middle ground, but this has to some extent delayed other work and has made for a more complex process of ‘understanding’.

Data Analysis and Manipulation

Analysis of the spreadsheets, importing and exporting the data, building web views of the data and data cleanup, I’d put at 25 days.

Generating the Data

Generating RDF instance data from the spreadsheets, post-implemention would take around 10 days to set-up and produce a sub-set of data. Longer if we want to process all of the data, which we aren’t doing right now.

Technical Implementation

Prototyping and implementing a simple approach to resolving identifiers and content negotiation will come in around 10 days.

Building Understanding

Developing an understanding of the ontology, of linked data principles and of how to construct RDF, I’d estimate at taking up around 25 days, including background reading and research, digging into the ontology, working through examples and selecting the right tools. This also includes time spent on understanding the data itself, working through the spreadsheets, talking with the content creator.

Resources

In terms of people involved, there are a number, including: consultant partners (expert in the semantic aspects and technologies), developer (for implementation and data processing), researcher (for understanding the data), Tate web manager and curator (for exploring the Tate use case and validation), Digital Library core staff (for future sustainability).

Summary

This gives us a grand total of around 80 days, or 16 weeks, around 4 months of work involving a range of people. This is starting out with complex data, gaps in understanding of the data and in how to ‘do’ open data, along with a myriad of potential mechanisms and methods.

What this doesn’t include, though, is the ongoing cost, in maintaining and building on the project outcomes and rolling out the generation of data to the full dataset and working with additions and changes to the dataset. Extending the work to do more ‘linking’, assessing the scalability of this approach, experimenting with different approaches and promoting the content.

Benefits

Given that it is still early days for ‘linked data’ and for our content, benefits are harder to quantify. The immediate benefits are engagement and seeing our researcher and Tate partners really begin to think about what open and linked data might offer them. Another immediate benefit is a new class of content for the Digital Library and the potential for extending our reach into storing and serving ‘data’ from York Digital Library and enabling us make progress in opening our data.

Looking forward, what OpenART and open data promise are greater visibility and usability of information, making research richer and easier and reducing the amount of information that needs to be stored locally and constructed manually. For cultural institutions like Tate, open and linked data offer opportunities for increasing web traffic and usage, by offering new ways of exposing and consuming data, greater possibilities for dynamic links between artworks, artists, places and events and richer visualisations of data. For researchers, and other consumers, open data offers mechanisms to follow different paths through information, tracing art history from creation, through the art market and into a galleries, telling interlinked and often unexpected stories along the way. Whilst OpenART deals with only a fragment of the art market, it offers a  possible model that, if extended, could open up art history to much more detailed analysis.

How do these balance out?

Personally, I think that the cost is worth it, but only if we start to see real use of the data, and more data being opened up. Given that this latter can’t happen without investment in the former, we need to make a persuasive case to continue the work, engage in the community and build tools to aggregate disparate content for the non-technical user. One lesson learnt though is that not all data is made equal, and ours is still ‘under construction’ which adds a layer of complexity when working with a moving target.

Leave a Comment

Getting to grips with the OpenART ontology

As part of our OpenART project, our project partners at from Acuity Unlimited have been working on an OWL ontology for describing our ‘London Art World’ dataset. This work is nearing completion and will be released in the near future.

For Paul Young (project developer) and I, ontologies are something of a learning curve and we have found that working the ontology  into a diagram has really helped. The following is our current version, still in draft and complete with some rough notes, but in the spirit of sharing I thought it would be useful to make it available using our very own York Digital Library. The diagram contains entities (classes) and object properties for the ‘London Art World’ dataset.

OpenART Ontology Diagram (Draft 2)

OpenART Ontology Diagram (Draft 2)

Leave a Comment

OpenART and Open Licensing

OpenART Licensing

As part of OpenART we need to decide how to license the data that we will expose about the ‘London Art World’ dataset. We said in the OpenART bid that we would make available the data under the terms of the Open Data Commons PDDL but this needed to be further explored with the project stakeholders and the creator and contributors of the dataset.

Within the Linked Data community there is a general desire to be as open as possible. The LOCAH project have used the most open license possible, the Creative Commons public domain CC0 license, and have successfully gained the support of their data owners from the Archives Hub and Copac.

JISC’s recent Open Bibliographic Data Guide encourages the use of a free and open licence: “Universities should proceed on the presumption that their bibliographic data will be made freely available for use and reuse.” (JISC Open Bibliographic Data Guide http://obd.jisc.ac.uk/rights-and-licensing) and “In the vast majority of circumstances, institutions should use a Creative Commons Attribution License (CC-BY) to encourage reuse of copyrightable material. For collections of factual data, the Open Data Commons Public Domain Dedication and License (ODC-PDDL) should be used.” (JISC Open Bibliographic Data Guide http://obd.jisc.ac.uk/rights-and-licensing). Another useful resource for those considering licenses is the DCC How-To Guide, How to License Research Data (http://www.dcc.ac.uk/resources/how-guides/license-research-data#x1-40003)
– this identifies some of the easy-to-miss pitfalls of choosing more restrictive licenses.

For OpenART, the Open Data Commons Licenses seem to be the most appropriate license for our dataset. I offered the project team two alternative approaches.

The first was the Open Data Commons Open Database License (ODbL). This is the most restrictive of the Open Data Commons Licenses, but does still allow for wide re-use and sharing. This applies to the ‘dataset’, not its contents. This would allow us to then license the content separately, as may be needed in future for the content contributed by others. For the core data (contributed by Richard Stephens) and the initial release of data, the complementary Database Contents License (DbCL) would be appropriate, which simply places the same condition on the content as on the database as a whole.

What ODbL allows for is summarized here: http://opendatacommons.org/licenses/odbl/summary/

In short, this license allows others to re-use freely, so long as they attribute the source, share under the same license and provide unrestricted open access to derived works. It does not explicitly prevent commercial use, but insists on a public and open version always being made available of any derived works.

The alternative is to use PDDL, the Public Domain license, in conjunction with the ‘Community Norms’. This would place the dataset and its contents in the public domain, with users encouraged to abide by the ‘norms’ of sharing the data in the same way (this has no legal basis, it is a statement of good faith).  This approach is asking our contributors to give up their rights in the data.

What PDDL allows for is summarized here:

http://opendatacommons.org/licenses/pddl/summary/

The ‘Norms’ are summarised here:

http://opendatacommons.org/norms/

The OpenART dataset is the result of several years work and has involved considerable intellectual efforts. Asking it’s contributors to cede copyright and attribution is quite a leap. ODbL, therefore, would seem the best compromise, offering wide-reuse whilst retaining a link back to those who created the data. The decision to use OCbL is not yet final, but remains the strongest contender. I will update the post when a final agreement is reached.

Leave a Comment

JISC YODL-ING Project recent presentations

Our YODL-ING project partners Steve Bayliss and Martin Dow from Acuity Unlimited, presented some of the excellent work they have been doing for us in the project at Open Repositories back in June. The Presentations are available from the conference web site:
  • Stephen Bayliss, Martin Dow, Julie Allinson. Using Semantic Web technologies to integrate thesauri with Fedora to support cataloguing, discovery, re-use and interoperability, Open Repositories 2011, Austin, Texas. PDF
  • Stephen Bayliss, Martin Dow, Julie Allinson. An integrated approach to licensing and access control in Fedora using XACML and the Fedora Content Model Architecture, Open Repositories 2011, Austin, Texas. PDF

Leave a Comment

Exploring different approaches to getting stuff done

I’m probably not alone in having some budget to spend up before July, the University financial year end. Like most Universities, we’re year facing a lean year in 2011/12 , so I have been trying to make best use of the funds I have available, within the timeframe given.

As part of this, I’ve identified some pieces of technical work and put these out in an ITT.  The full ITT is available from:
https://vle.york.ac.uk/bbcswebdav/xid-901764_3

In brief, the three pieces of work are:
1) Implementation of the University of York Archives Hub Spoke and archival stylesheet development

2) Re-usable EAD generation and conversion (from spreadsheets)

3) Implementing page turning and sequencing within a Fedora Commons repository

There are many more things we want to do in the Digital Library but I have chosen these three to help us get a feel for how this approach would fare in the community, whether there are contractors out there looking for small pieces of work like this.

Leave a Comment

« Newer Posts · Older Posts »