During the course of the OpenART project we’ve come across a number of different technologies and standards that we could use. In this post we aim to go into the ones we’ve used and found useful.
We aim to answer the following series of questions:
What technologies, frameworks, standards are you using in your project? What are your impressions of them? What difficulties have you encountered? What approaches and techniques have worked well? What advice would you give to others engaging with them?
The source information was supplied as a set of Excel spreadsheets containing semi-structured data. These spreadsheets were provided by the researcher and were the “live” data capture mechanism for the transcribed source data. As such they were subject to structural changes during the lifetime of the project. Various tools and approaches for manipulation of this source data were explored.
- Relational Database (RDBMS)
- Early in the project an approach of migrating the data to an RDBMS was explored
- This resulted in a cleaner version of the source information, but the approach was abandoned as the spreadsheets were a live tool
- The RDBMS ontology mapping plugin for Protege was explored for transfer of RDBMS data to RDF/OWL
- Excel Spreadsheet manipulation
- Google Refine was used for very quickly visualising and manipulating source information
- RDF Extension for Google Refine was used for exporting cleaned up data held in Refine to test versions of RDF linked data, and for for matching nodes to validate and connect parts of the data (via the reconciliation feature)
- GNU SED (Stream Editor) is a powerful general purpose text manipulation tool. This hits the cleaning spots the developers can’t reach (quickly) in the visual Refine environment.
- Protege version 4.1 was used for ontology development. Protege is a powerful and fully featured ontology IDE (integrated development environment)
- Rapper (part of the Raptor RDF Parser toolkit) was used for translating between various RDF serialisations. Rapper is a parsing and serialising utility for RDF.
- SPARQL (A standard query language for RDF) for querying and checking dtata
- openSUSE Linux, a major Linux distribution, was used as a base operating system
- SuseStudio was used to build and deploy bespoke virtual machines to Amazon EC2. SuseStudio allows you to choose packages to build the virtual machine and will directly deploy to an Amazon web service account all within a web-based interface
- Amazon Web Services (AWS) was used for hosting, particularly the Elastic Compute Cloud (EC2) service for virtualisation
- The Apache web server with mod_rewrite was used for web hosting of ontologies and data, with content negotiation
- OntologyBrowser was used for for storing, viewing, manipulating and accessing the ontology, using a web front end and REST API
- The source information was being actively worked upon during the lifetime of the project. This presented difficulties as working on the ontology and triplification to RDF concurrently with changes in the source information made decision-making and coordination difficult
- The source information was provided as Excel spreadsheets. These were used as an organisational environment for capturing data by the researcher, and provided free text search, categorisation and a basic namiing scheme for entities. Although a degree of rigour was applied, the environment could not be described as “structured data” such as that provided by a database, and this provided challenges in interpreting and modelling the information.
- Originally the project assumed that hosting of the information as part of the Court, Country, City project would provide a structured source for the information; however the requirements from this project were sufficiently divergent from the OpenART project’s requirements for this not to be the case. As a result the project did not consider alternative automated processes to assist in source data manipulation and triplification until late in the project.
- Bulk changes in the ontology using Protege was difficult to do efficiently
- Synchronisation between environments was a challenge with new versions of ontologies
- Difficulties in extending the combined ontologies – particularly when trying to move from earlier versions expressing “simplifications” to a more complex framework later in the project. It would have been easier to start with a more complex framework initially rather than trying to extend the earlier version.
- Naming conventions became very cumbersome, for example when trying to create inverse properties that sometimes had no natural or simple English language fit
- Protege and RDF tools used locally were found to be very mature. Protege is now mature enough to be usable by newcomers; and also provides a route to further development as it is based on an OWL-API. However it was difficult trying to reason with anything other than a small set of data points.
- Google Refine was surprisingly efficient and workable. Additional functionality such the as reuse of RDF mapping templates was useful.
- Cloud services (Amazon EC2) were relatively easy to use; and the combination of openSUSE and SuseStudio provided an efficient and repeatable mechanism for deploying virtual machines
- OntologyBrowser presents a nice iinterfacae for ontology browsing in HTML. Its user-friendly syntax for ontology axioms means that you don’t need to be an OWL or logic expert. However it is very “data-centric” and not an end-user environment.
- Google Refine is good for rapid development when a non-trivial ontology requires that mappings and ontology need to be developed in tandem
- Use of Protege (see above)
- Use of cloud services (see above)
- Investigate using a collaborative ontology environment, such as knoodl
- Invest a small amount of time in collaborative environments, this can lead to a big payoff; cloud services are relatively easy to use
- Start with an ontology framework complex-enough to represent the modelling required; it can be difficult moving from a simple to a more complex version