Skip to content

System Architecture

Christian Forney, Peter Dängeli

General Overview

hallerNet is an XML-based publication environment for historical research data. Its data consists of structural data and textual data (transcriptions, historical accounts) encoded in accordance to the [TEI] guidelines. Building on a common file system, the hallerNet environment is well-suited for different editing workflows. Schema-based data validation and version control allow a high degree of control over the research data at any stage.

The presentation of the data on the web platform uses a frontend application implemented using a common JavaScript framework. It combines two interacting services:

A platform specific editing environment in the form of a framework and plugin for oXygen XML Editor is used to facilitate data authoring. In principle, the XML code is rendered using CSS rules to offer a user friendly editing interface. The environment integrates data lookup queries for entity linking and makes use of specific backend endpoints, for instance for the retrieval of the next available ID or updating the index.

All data and software of hallerNet are released under free licenses.

Governance and contact

The hallerNet platform is operated by the Historical Institute of the University of Bern. Please refer to https://hallernet.org/about for more information.

Contact: contact@hallernet.org

About this Documentation

The goal of this documentation is to provide an overview over the data model and software that constitutes the technical base of the hallerNet platform.

Core Principles

Architectural decisions were influenced by the following aspects:

  • research data as first class citizen
  • modularity
  • conformance to standards (technical standards and community standards)
  • facilitate flexible use of XML tooling and languages

Data

Research Data as First Class Citizen

A strong motivation for the chosen design/architecture emerged from the history of Haller research in Bern and its use and re-use of born-digital research data over many years. The value of the data and its description quickly became clear when porting the data from a (relatively) closed database to a newly created open environment. In order to keep the data at least equally accessible and extensible in the future, it seems sensible to avoid tool and system dependencies on the level of data as much as possible. Consequently, research data was put at the centre of the infrastructure and the tools and workflows were built around it.

In more operational terms, all data is stored in plain text formats on a conventional file system[1] and subject to revision control (Git). In order to restrict version control to significant changes and to ignore other changes (e.g. with regard to insignificant whitespace), (server-side) file storage and transformation runs make use of unified file serialisation.

Data model / TEI

The data model of the platform consists of two major parts:

  • structural data, i.e. metadata on the entities
  • the extensible hallerNet edition model detailing the transcription methods

These parts are described in more detail in the data model documentation and the edition model documentation.

Formally, the model is specified in a number of Schematron schemas and a more global ODD description.

FAIR data repository

The research data is periodically published on the Zenodo FAIR data repository in its entirety under a free license (CC BY 4.0).

Backend Application

The hallerNet backend application serves the platform data in formats consumable by the frontend application and external services (such as CorrespSearch).

Conceptually, a series of data conversions are at play, orchestrated based on the structure of the URL request. For its suitability to process XML files (in a templating fashion based on pattern matching) as well as its versatility, XSLT 3.0 was chosen as predominant application programming language.

The implementation of the backup application is based on the open source web application framework XSLWeb, which is an accessible and pragmatic successor of pipeline-based frameworks such as Apache Cocoon. It is a server based application (written in Java) acting as orchestrating entity that parses the client requests and invokes conversion pipelines.

General approach and structure

A modularised request-dispatcher stylesheet acts as the backbone of the conversion pipelines. Depending on the structure of the client request, data is selected, combined, enriched, and finally transformed to an output format (XML, JSON, Markdown) or posted to other services (such as the search index).[2]

Illustration of the principle

A simple example may illustrate how the application works:

The request /api/data/platform/about/xml by the frontend application is first processed by the main request dispatcher file (request-dispatcher.xsl ), which contains just a handful template rules, none of which matches the pattern of the request. Besides the locally defined rules, six modules are included:

xml
  <xsl:include href="request-dispatchers/req-disp-apis.xsl"/>
  <xsl:include href="request-dispatchers/req-disp-data.xsl"/>
  <xsl:include href="request-dispatchers/req-disp-file.xsl"/>
  <xsl:include href="request-dispatchers/req-disp-index.xsl"/>
  <xsl:include href="request-dispatchers/req-disp-lod.xsl"/>
  <xsl:include href="request-dispatchers/req-disp-sitemap.xsl"/>

The request is being evaluated against the templates in these modules and a match is found in the data module:

xml
  <xsl:template name="api-data-platform-about-xml" match="/req:request[matches(req:path ,'^/api/data/platform/about/xml/?$')]">
    <pipeline:pipeline>
      <pipeline:transformer name="pipeline-step1" xsl-path="lib/api/data/about-xml.xsl" log="true"/>
    </pipeline:pipeline>
  </xsl:template>

As the structure of the request (req:request) matches the given regular expression, a pipeline is invoked. Specifically, the transformation process defined in about-xml.xsl is launched. As a general rule, the paths were the stylesheets are found, reflects the lib/api/data/ reflects the module name.

Modules of the backend application (and API)

Module name(Main) tasks covered
APIsAPI documentation
dataRepresentations of entity and edition files, taxonomy files and other platform materials
fileFile system operations (query, get most recent, delete)[3]
indexPopulating the search index
LODID-based entity resolving, generation of exchange formats
sitemapSitemap files for platform contents (currently being refactored)

Frontend Application

The hallerNet frontend is a web application behaving as a single-page application (SPA). As such, the app dynamically rewrites the current page based on data from the backend data API and additional endpoints.

It is built with Quasar, an open-source and Vue.js-based framework which provides state-of-the-art UI components.

Pages

The frontend has several views:

TEI to HTML transformation

The transformation of XML/TEI-encoded transcriptions to HTML - what can be displayed in the web browser - is one of the frontend's core functions.

Roughly speaking, this works as follows:

  1. The backend delivers XML/TEI as serialized string in JSON to the frontend.
  2. Here, on the client side, the received XML string is parsed, transformed into Vue's HTML-based template syntax using DOM methods[4] and serialized again.
  3. The serialized template is integrated into the reactivity system of Vue using dynamic components.

Deployment

Auto-deployment is set up for the development environment so that code changes are automatically transferred to a development server after each commit. Here, the changes can be tested before they are published in production.[5]

Search Index

The chosen file system-based approach of the architecture necessitates an encompassing search index that always reflects the current state of the data in order to facilitate rapid queries. This is essential for the search and filter functionalities available in the frontend/web portal, but it is also very convenient for data editing and data processing tasks.

For this reason, the platform maintains an instance of Apache Solr, a mature and widely-used open-source search engine that is equally well-suited for XML data and workflows and web development (JSON) and offers a wide range of features and configuration options.

General approach and structure

The configuration of the index is maintained in a managed-schema file, which is edited manually and kept under revision control.[6]

The indexing of the file contents and the references between files is done by HTTP Post requests by the backend application. For the most part, these requesta are initiated by the editing environment. In particular, saving and closing a file always triggers (re-)indexing of the file in question.

Interplay of backend application and search index

The backend-application interacts with Solr in a bidirectional manner: on the one hand data indexing is initiated during (indexing) pipeline runs and on the other hand indexes are consumed by other (non-indexing) pipeline runs for reasons of efficiency (index queries are vastly quicker than filesystem queries).

As far as this creates a circular dependency, data indexing plays a more fundamental role than querying the index:

data creation               most fundamental
data indexing                      ⋀
index querying                     ⋁
data delivery               most subjacent / dependent

Image Server

Since 2024/Q1 all digitised images owned by hallerNet are part of the Haller collection on the UB IIIF server and integrated using the IIIF protocols. Refer to https://www.ub.unibe.ch/service/open_science/iiif_service_fuer_bilder for more information.

In earlier stages of development, images were served on a self-hosted image server (IIPImage as a Docker application on a DSM NAS, 2019-2023). hallerNet initiated the development of the IIIF hosting service by the University Library of Bern.

Authoring Environment

Centerpiece of the hallerNet's authoring environment is a custom framework for Oxygen XML Editor. In conjunction with an authoring plugin, this framework offers a graphical interface that assists the creation of (highly structured) entity files and source transcriptions. It takes care of ID assignment, offers convenient methods to supply taxonomy-based information and to create links to other files, provides quality and integrity checks and synchronises XML encodings with the search index of the platform.

The framework integrates type-specific data templates and input masks with custom operations and menus, some of which are based on Ediarum developments, as well as calls to the backend API.

Built-in functionality

When editing files using the authoring environment a number of actions happen in the background. While this may require a little patience in some cases, the actions are always motivated by user benefits.

NOTE

Creation of a new file (based on type-specific template)

  • Querying last assigned ID and assign the following ID to the new file
  • Storing the hull of the file on the server
  • Indexing the ID and label in the search index -- this allows to find the entry in the frontend immediately

Opening an existing file

  • Fetching taxonomy information for displaying the taxonomy dropdowns

Editing an existing file

  • File validation; for instance, a warning is issued if a newly assigned norm data identification is already present in the system
  • Rendering of the author mode views; triggering structural changes to the XML (addition/removal of elements/attributes) using the buttons requires re-rendering of the authoring view and the editor may feel frozen for a short while

Closing a file

  • Full indexing run; this enables an always up-to-date index and frontend presentation
  • Data transformation on closing carrying out operations depending on the object type (e.g. renumbering footnotes after changes for transcriptions, fetching geo coordinates for places based on Geonames ID)
  • Search query over all entities of a given type and rendering the dropdown list; this enables immediate linking of newly created files

Using oXygen's built-in framework and plugin approaches allows to roll out new versions to the whole team without the need for additional communication. When launching the program, users are notified of new versions, which, upon confirmation, are fetched from the server and installed locally in a smooth manner.

Hosting and Deployment

All hallerNet services are currently run in a Virtuozzo cloud computing environment in Swiss data centres (operated by Hidora).

The only exceptions to this setup are a simple legacy file server (run as a Github page) and frontend pre-release instances (auto-deployed to netlify infrastructure).

Roadmap and future developments

Concrete proposals

  • Introduction of globally unique persistent identifiers (GUPIs).
  • Extension of the existing ID system to allow for a greater number of objects.
  • RDF endpoints are developed in exchange with the SwissBritNet SNSF research project.

Under consideration

  • More global unification of file serialisation, e.g. using Git hooks.

Specific information on implementation, configuration and development of certain components is being maintained directly in the software repositories.

Further reading




  1. On the multiple advantages of this approach see https://wiki.c2.com/?PowerOfPlainText ↩︎

  2. HTTP requests are issued using the XSLWeb implementation of the EXPath HTTP Client Module. ↩︎

  3. These methods are called from transformation pipelines, but are also essential to functionalities of the editing environment. ↩︎

  4. Two JavaScript libraries from Sebastian Zimmer are used for this purpose: X.js facilitates the query and modification of XML fragments by providing element- and attribute-specific methods. TEI2Vue.js transforms XML into Vue's HTML-based template syntax with the help of X.js. ↩︎

  5. The Git repository for the frontend is linked to the cloud provider Netlify (Continuous Deployment, CD). The development version (main branch) is available at https://haller.netlify.app. ↩︎

  6. As changes to the schema are developed and tested locally and then deployed to the server, the generally prefered way of modifying the Solr schema by API is not advantageous. ↩︎

Released under ISC License.