Archive for ontologies

OpenART Final Report

Made by OpenART:

  • The OpenART ontology, an event-driven ontology produced to describe the ‘artworld’ dataset. The ontology is split into a number of parts to allow greater re-usability. It should be considered ‘work in progress’, although the version published is complete for the OpenART project: dlib.york.ac.uk/ontologies/
  • An ontology browser version of the ontologies can be found at dlib.york.ac.uk/ontologies/openart – for easier reading!
  • Sample data for each ‘primary’ entity in the form of a rdf/xml document and turtle document are available: dlib.york.ac.uk/data
  • Void description for the dataset: dlib.york.ac.uk/data
  • A script for creating all of the documents and ingesting them into the Digital Library Fedora repository.
  • RDF/A embedded in the pages of artworld.york.ac.uk (forthcoming)

Next steps for OpenART and the University of York:

For the University of York, there is some work to complete in order to get the full (current) dataset into our Fedora repository, mainly in setting up the url rewriting and content negotiation rules.

After that, we would ideally like to apply the same linked data principles to other Digital Library content, particularly some of the rich image content that we have. This would involve mapping and modelling work, for example VRA image metadata to linked data, and automating the generation of RDF.

Some thoughts

The approach taken in OpenART is somewhat twofold, with prototyping using a variety of tools (summarised in the technical approaches post) which could be further explored in future work.

The dataset which drives OpenART was released as a web application in October at http://artworld.york.ac.uk, to meet the requirements of the separate AHRC-funded project which it was created out of. This site has been developed as a database-driven application, an approach chosen as a best fit for the time available. The site, always envisaged as a human-user end point, was not initially designed for linked data. Indeed, one might argue that databases and ontologies do not make happy bedfellows. However, what we have found in the project is that is was relatively straightforward to (1) create a script to extract open data documents from the database and ingest that into our Fedora Digital Library, and (2) add RDF/A tagging to the web site itself.

One of the benefits of this approach is that it provides us with a non-proprietory back-up and preservation routine for the database, playing to one of Fedora’s strengths. It also demonstrates how Fedora can be used in place of a simple file structure to serve up linked data documents, bringing with it the advantages of data management, indexing and version control.

What this rather document-centric approach does not provide is a fully indexed RDF store with a SPARQL end point. Although Fedora has these as part of its stack they are internal tools for Fedora, not designed for indexing anything other than the core Fedora datastreams. Future work to enable Fedora’s ‘semantic’ capacity for external content would be extremely useful. The European Interactive Knowledge Stack (IKS) Project is doing interesting work in this area (http://www.iks-project.eu/).

Opportunities

OpenART was always focussed on a narrow rich seam of data, rather than a broad simpler dataset. There is an opportunity here to see how these two approaches can co-exist. Good ontology modelling will allow rich drilled-down terms to be mapped back to broader concepts for greater findability of content, whilst allowing much finer-grained analysis of the detail captured by the ontology. Where there may be a gap is in the tools which query, visualise and analyse the data sources.

Extending existing applications to better support open data is another opportunity, allowing standard repository platforms such as EPrints, DSpace and Fedora Commons to offer standard linked data endpoints, with options for configuring the data exposed.

Google Refine has come out strongly in OpenART as an extremely useful tool for manipulating datasets. It is particularly well suited to people who do not have in-depth programming skills but want to get RDF out of semi-structured documents. There is an opportunity to demystify some of the myth around creating open data and RDF, which can be quite simple to do.

Evidence of reuse

Our data has not been re-used, although we have had interesting discussions with a range of stakeholders from Tate, to be summarised in a blog post on how others could follow in OpenART’s footsteps. Skills What skills were used in your project? Did you already have these skills in your team or did you need to develop them or bring in external experts? Are the processes you have developed embedded in your institutional practice now or are there plans to embed them? Do you plan to develop these skills further? OpenART did have a range of skills in the team, in java programming, databases, Fedora Commons and metadata. External experts brought ontology modelling and RDF expertise, along with extra Fedora Commons expertise. These were essential for the project. OpenART has seen members of the project team gain much deeper knowledge of RDF and ontologies which we would hope to embed in York with further projects around linked open data.

Lessons

Lesson 1

Ontology modelling is complex so allow plenty of time for it. Take time to consider the data model and consider the best approach, a simpler ‘mix and match’ of schema terms might be suitable. For OpenART, where the data is very specific, an ontology was considered the best approach.

Lesson 2

Use Turtle during development phases and get familiar with validation and inspection tools. Turtle is a simple notation for RDF and is very easy to write and to understand. It can be exported directly out of Google Refine and validated by common tools (eg. http://www.rdfabout.com/). Any23 (http://any23.org/) can be used to generate other formats, such as RDF/XML or RDF/A. Sindice’s Inspector tool (http://inspector.sindice.com/) is a useful tool for viewing the relationships in RDF and checking that documents are not just valid, but also correct. Google Refine can be used as a relatively rapid application for generating RDF samples.

Lesson 3

Come up with use cases. What questions do users want to answer with your data? What links do they want to follow? Understanding the uses and potential uses of the data can help both with modelling but also with making the case for doing linked data in the first place.

Comments (1)

OpenART Technology Choices

During the course of the OpenART project we’ve come across a number of different technologies and standards that we could use. In this post we aim to go into the ones we’ve used and found useful.

We aim to answer the following series of questions:

What technologies, frameworks, standards are you using in your project? What are your impressions of them? What difficulties have you encountered? What approaches and techniques have worked well? What advice would you give to others engaging with them?

“What technologies, frameworks, standards are you using in your project?”

Data preparation and mapping/triplification/lifting

The source information was supplied as a set of Excel spreadsheets containing semi-structured data. These spreadsheets were provided by the researcher and were the “live” data capture mechanism for the transcribed source data. As such they were subject to structural changes during the lifetime of the project. Various tools and approaches for manipulation of this source data were explored.

  • Relational Database (RDBMS)
    • Early in the project an approach of migrating the data to an RDBMS was explored
    • This resulted in a cleaner version of the source information, but the approach was abandoned as the spreadsheets were a live tool
    • The RDBMS ontology mapping plugin for Protege was explored for transfer of RDBMS data to RDF/OWL
  • Excel Spreadsheet manipulation
    • Google Refine was used for very quickly visualising and manipulating source information
    • RDF Extension for Google Refine was used for exporting cleaned up data held in Refine to test versions of RDF linked data, and for for matching nodes to validate and connect parts of the data (via the reconciliation feature)
    • GNU SED (Stream Editor) is a powerful general purpose text manipulation tool. This hits the cleaning spots the developers can’t reach (quickly) in the visual Refine environment.

Ontology development

  • Protege version 4.1 was used for ontology development. Protege is a powerful and fully featured ontology IDE (integrated development environment)

Linked Data manipulation and production

  • Rapper (part of the Raptor RDF Parser toolkit) was used for translating between various RDF serialisations. Rapper is a parsing and serialising utility for RDF.
  • SPARQL (A standard query language for RDF) for querying and checking dtata
    • Rasqal via the online service at Triplr. Rasqal provides RDF querying capabilities, including SPARQL queries
    • ARQ, a SPARQL query engine for Jena

Hosting and base systems

  • openSUSE Linux, a major Linux distribution, was used as a base operating system
  • SuseStudio was used to build and deploy bespoke virtual machines to Amazon EC2. SuseStudio allows you to choose packages to build the virtual machine and will directly deploy to an Amazon web service account all within a web-based interface
  • Amazon Web Services (AWS) was used for hosting, particularly the Elastic Compute Cloud (EC2) service for virtualisation
  • The Apache web server with mod_rewrite was used for web hosting of ontologies and data, with content negotiation
  • OntologyBrowser was used for for storing, viewing, manipulating and accessing the ontology, using a web front end and REST API

“What difficulties have you encountered?”

Source data manipulation and conversion

  • The source information was being actively worked upon during the lifetime of the project. This presented difficulties as working on the ontology and triplification to RDF concurrently with changes in the source information made decision-making and coordination difficult
  • The source information was provided as Excel spreadsheets. These were used as an organisational environment for capturing data by the researcher, and provided free text search, categorisation and a basic namiing scheme for entities. Although a degree of rigour was applied, the environment could not be described as “structured data” such as that provided by a database, and this provided challenges in interpreting and modelling the information.
  • Originally the project assumed that hosting of the information as part of the Court, Country, City project would provide a structured source for the information; however the requirements from this project were sufficiently divergent from the OpenART project’s requirements for this not to be the case. As a result the project did not consider alternative automated processes to assist in source data manipulation and triplification until late in the project.

Ontology development tooling

  • Bulk changes in the ontology using Protege was difficult to do efficiently
  • Synchronisation between environments was a challenge with new versions of ontologies

Ontology development informational issues

  • Difficulties in extending the combined ontologies – particularly when trying to move from earlier versions expressing “simplifications” to a more complex framework later in the project. It would have been easier to start with a more complex framework initially rather than trying to extend the earlier version.
  • Naming conventions became very cumbersome, for example when trying to create inverse properties that sometimes had no natural or simple English language fit

“What are your impressions of them?”

  • Protege and RDF tools used locally were found to be very mature. Protege is now mature enough to be usable by newcomers; and also provides a route to further development as it is based on an OWL-API. However it was difficult trying to reason with anything other than a small set of data points.
  • Google Refine was surprisingly efficient and workable. Additional functionality such the as reuse of RDF mapping templates was useful.
  • Cloud services (Amazon EC2) were relatively easy to use; and the combination of openSUSE and SuseStudio provided an efficient and repeatable mechanism for deploying virtual machines
  • OntologyBrowser presents a nice iinterfacae for ontology browsing in HTML. Its user-friendly syntax for ontology axioms means that you don’t need to be an OWL or logic expert. However it is very “data-centric” and not an end-user environment.

“What approaches and techniques have worked well?”

  • Google Refine is good for rapid development when a non-trivial ontology requires that mappings and ontology need to be developed in tandem
  • Use of Protege (see above)
  • Use of cloud services (see above)

“What advice would you give to others engaging with them?”

  • Investigate using a collaborative ontology environment, such as knoodl
  • Invest a small amount of time in collaborative environments, this can lead to a big payoff; cloud services are relatively easy to use
  • Start with an ontology framework complex-enough to represent the modelling required; it can be difficult moving from a simple to a more complex version
Author: Martin Dow

Leave a Comment

Getting to grips with the OpenART ontology

As part of our OpenART project, our project partners at from Acuity Unlimited have been working on an OWL ontology for describing our ‘London Art World’ dataset. This work is nearing completion and will be released in the near future.

For Paul Young (project developer) and I, ontologies are something of a learning curve and we have found that working the ontology  into a diagram has really helped. The following is our current version, still in draft and complete with some rough notes, but in the spirit of sharing I thought it would be useful to make it available using our very own York Digital Library. The diagram contains entities (classes) and object properties for the ‘London Art World’ dataset.

OpenART Ontology Diagram (Draft 2)

OpenART Ontology Diagram (Draft 2)

Leave a Comment