Yet Another Rails App in Subdirectory Howto

Last week I tried, unsuccessfully, to help someone on Slack with deploying a Rails application in a subdirectory. By which I mean something like http://example.com/subdir instead of just the root as is more usually the case. As it turns out I then needed to do this myself this week and had just as many problems. As least some of the difficulty comes from there seeming to be a few different ways to achieve this, as a Google search will quickly show. So what the internet definitely doesn’t need is yet another howto. So here is one. To try and at least prevent some confusion this is for a Rails 5 application deployed using Apache and Passenger.

After a few attempts with various environment variables the simplest working solution was to use PassengerBaseURI in the Apache site configuration.

<VirtualHost *:80>
   ServerName https://example.com
   DocumentRoot /var/www/myapp

   <Directory /var/www>
      Options -MultiViews
      PassengerBaseURI /subdir
      PassengerAppRoot /websites/myapp
   </Directory>

</VirtualHost>

Here the document root, /var/www/myapp is a symbolic link to the PassengerAppRoot.

The benefit with this is that there seems to be very few changes needed in the application itself. The only real modification is to do with the assets. As the application will be running in the production environment the assets will be served by Apache, not the Rails application. Apache will try to find the assets at public/subdir/assets, so we need to tell Rails to generate them there rather than the default location. This can be done with the config.assets.prefix configuration setting. Add this to config/environments/production.rb

config.assets.prefix = "/subdir/assets"

Now when the assets are precompiled they will be created in the correct location:

RAILS_ENV=production bundle exec rake assets:precompile

That should be all that is required to be able to access the application at the subdirectory as required.

NER-ding it up

At the end of 2018 one of our members completed ingesting a large collection of photographs taken in various locations around Ireland from the 1950s onwards. It would be great for the user to be able to browse this collection on the map view. Unfortunately, however, the approach I described in Mapping with Added SPARQL can’t be applied, as the metadata does not contain geographic information in the required field, i.e., dc:spatial. There is, however, quite a lot of location information in the metadata in the title and description fields, e.g., ‘View at Lough Leane [Lower Lake], Killarney, County Kerry.’. But how to make use of this?

What is needed is a method for extracting the locations from this text. Given that the collection is quite large, an automated approach would be preferable. Happily there is a technique for doing exactly this, Named-entity Recognition (NER). NER attempts to locate and classify named entities (e.g. locations, people, organisations) in unstructured text. There are various services available that will do this for you, but as I have been using OpenRefine in the previous linked data workflows, it made sense to integrate this here as well. Ruben Verborgh created a NER extension for OpenRefine as part of the Free Your Metadata collaboration. The extension had not been updated in some time, and so was not working with the latest OpenRefine 3.1 release. I’ve created an updated fork of the extension that you can find on GitHub. The extension allows you to configure and use multiple online services to perform NER on a column of values in an OpenRefine project. These services include, DBpedia Spotlight, Dandelion API and others.

Once installed using the extension is straightforward. Click the down arrow in the title of the column you want the NER to be performed on and select ‘Extract named entities…’.

NER menu option

This opens a dialog where you can select the services you want to use, and set any extraction parameters available.

NER services dialog

There first service that I tried, as it is freely available, was the DBpedia Spotlight service. This performed well, and as an added bonus also returns a DBpedia URL with the value. As you can see in the example below, it has managed to extract ‘Valentia’ and ‘County Kerry’, from the value ‘Valentia Harbour, Valentia, County Kerry’.

DBpedia Spotlight NER

Many of the other services require signing up for an API key, and have various limits on the number of calls that you can make. Although this would probably be fine for most use cases, I wanted to try and configure a service that I would have more control over. Stanford NLP provides a set of human language technology tools including NER. It is possible to download the software and run it locally as a service. This was easy to do following the instructions available in the CoreNLP documentation. To use this with the NER extension meant writing a new service class. The NER extension has been written in such a way as to make it extensible. The full code for the new service class, StanfordNLP.java is available to view on GitHub. Running NER on the column against this service, performed similarly to the DBpedia Spotlight service, although in some cases it gave perhaps better entities. For example it returns ‘Kerry’, rather than ‘County Kerry’. It also managed to extract ‘Valentia Harbour’, not just ‘Valentia’.

Stanford NLP

In some cases it also seemed to find more entities, such as ‘Portmagee’. Obviously here we are missing the DBpedia URL, but as the next stage will involve reconciliation using the RDF extension, that might not be an issue.

Stanford NLP

Linked Data Step 3

Linked Data Step 1 was adding RDF, generated from DRI XML metadata, as an output format of the DRI repository application. Step 2 is using OpenRefine to reconcile the RDF metadata fields with other datasets; namely, DBpedia, LCSH (for subjects) and Linked Logainm (for places).

The last step is to import this enhanced RDF, exported from OpenRefine, together with the RDF output generated by the DRI application into a triple store to allow for querying. The final full workflow then, comprised of each of the three steps, is as shown in the figure below.

Workflow

The RDF generated by the application from the object’s XML metadata follows the guidelines produced by the BBC’s Research and Education Space (RES) Project. To allow for the automation of the process of importing the RDF into the triple store, more components from the RES project can be used. Anansi is a web crawler that ‘includes specific support for Linked Data’. Objects from the repository the we wish to import can be selected by adding their URLs to the Anansi crawl queue. The second component Twine reads resources that have been successfully crawled and processed by Anansi, processes these and imports the resulting RDF to the triple store.

Anansi Twine processing

Now on to the fun of SPARQL queries!

Linked Data Step 2

In Linked Data Step 1 I ended by saying that the next step was to perform reconciliation on selected object metadata terms and to add the resulting links to the object’s RDF output. In fact, looking at the RDF snippets given as examples, you can see that this was already done. The fields I reconciled were dc:creator, dc:subject and dc:spatial. To do this those terms were exported from all the objects contained in one collection and then imported as a project in OpenRefine. Using OpenRefine together with the RDF extension I reconciled these against DBpedia, LCSH (for subjects) and Linked Logainm (for places). The RDF extension allows you to build a skeleton (a template for the RDF output) for exporting to either RDF/XML or TTL. The exported RDF can then be imported to a SPARQL endpoint.

Here, for example, is the generated RDF output for a reconciled subject term, based on the skeleton that I defined:

<rdf:Description rdf:about="https://repository.dri.ie/resource/easter_rising_1916">
        <skos:exactMatch rdf:resource="http://id.loc.gov/authorities/sh85068025#concept"/>
        <skos:prefLabel xml:lang="en">Easter Rising 1916</skos:prefLabel>
        <rdfs:seeAlso rdf:resource="http://dbpedia.org/resource/Easter_Rising"/>
</rdf:Description>

You can see that the subject term value Easter Rising 1916 has been linked to an LCSH subject and a DBpedia entry. Now when you view the RDF format output of an object containing this value, you see the link to the RDF resource instead of the literal:

<https://repository.dri.ie/catalog/3b5918966#id>
dc:subject 
     <https://repository.dri.ie/resource/easter_rising_1916>,

When generating an object’s RDF output, RDF subject URIs are retrieved by the RDF formatter code by issuing a SPARQL query to the SPARQL endpoint.

def sparql_subject(value)
  Rails.cache.fetch(value, expires_in: 48.hours) do
    provider = DRI::Sparql::Provider::Sparql.new
    provider.endpoint = AuthoritiesConfig['data.dri.ie']['endpoint']

    triples = provider.retrieve_data([nil, 'skos:prefLabel', "\"#{value}\"@en"])

    triples.present? ? triples.first[0] : nil
  end
end

This method takes the literal value, e.g. ‘Easter Rising 1916’, and queries for the subject of a triple that has this value as a preferred label. If one is found, this is used in place of the literal value.

The final step then is to take the RDF output for each object in a collection, including the new reconciled URIs and import them into the SPARQL endpoint.

App to Engine

Some time ago I put together an application for batch ingesting data into the DRI repository. This app was to replace the existing command-line tool. Although I did use the app a few times, it never really became part of the production infrastructure. At least, not in a way that would make it available to end users. I decided that the easiest way forward was to integrate the functionality into the repository, but without having to add all the code directly into the codebase. To do this I converted the app into a Rails engine.

First step, create the engine skeleton:

rails plugin new dri_batch_ingest -T --mountable --full

This creates the basic directory structure and files needed by the engine. Once this exists the next task is to copy the source files from the app into the correct locations in the engine. Because this is to be a namespaced engine the main directories are:

app/controllers/dri_batch_ingest
app/models/dri_batch_ingest
lib/dri_batch_ingest

The original application had a number of dependencies for the frontend. The correct place to put these in an engine is the gemspec file, dri_batch_ingest.gemspec.

s.add_dependency "rails", "~> 4.2.10"
s.add_dependency 'kaminari'
s.add_dependency 'fuelux-rails-sass'
s.add_dependency 'underscore-rails'
s.add_dependency 'iconv'
s.add_dependency 'filesize'

Two extra dependencies that could not be added to the gemspec are avalon_ingest and browse-everything. The first of these bundles together some functionality for performing the batch ingest, the other allows for browsing various cloud-storage endpoints. I’m pulling these dependencies from github, which is not allowed within the gemspec, so these had to be added to the Gemfile.

gem 'avalon_ingest', git: 'https://github.com/stkenny/avalon_ingest'
gem 'browse-everything', git: 'https://github.com/stkenny/browse-everything.git', branch: 'feature/per_user'

To load these dependencies we need to edit the engine file, lib/dri_batch_ingest/engine.rb:

# dependencies
require 'underscore-rails'
require 'fuelux-rails-sass'
require 'browse_everything'

module DriBatchIngest
  class Engine < ::Rails::Engine
    isolate_namespace DriBatchIngest

    # use rspec for testing
    config.generators do |g|
      g.test_framework :rspec
    end

    # this allows the migrations to stay in the engine
    # rather than having to copy them into the app the 
    # engine will be used in
    initializer :append_migrations do |app|
      unless app.root.to_s.match root.to_s
        config.paths["db/migrate"].expanded.each do |expanded_path|
          app.config.paths["db/migrate"] << expanded_path
        end
      end
    end

  end
end

and then require the javascript in the asset pipeline, app/assets/javascripts/dri_batch_ingest/index.js

//= require underscore
//= require fuelux
//= require_tree .
//= require browse_everything

The namespacing of the engine means that migrations also look a bit different from a normal application. They have a prefix, in this case dri_batch_ingest_. Here is an example:

class CreateTableIngestBatch < ActiveRecord::Migration

  def change
    create_table :idri_batch_ingest_ingest_batches do |t|
      t.string :email
      t.string :collection_id
      t.integer :user_ingest_id
      t.text :media_object_ids
      t.boolean :finished, :default => false
      t.boolean :email_sent, :default => false
      t.timestamps
    end
  end
end

This engine now encapsulates all the batch ingest functionality needed by the main DRI repository application. You can see the full engine in the DRI github.