System Architecture
Christian Forney, Peter Dängeli
General Overview
hallerNet is an XML-based publication environment for historical research data. Its data consists of structural data and textual data (transcriptions, historical accounts) encoded in accordance to the [TEI] guidelines. Building on a common file system, the hallerNet environment is well-suited for different editing workflows. Schema-based data validation and version control allow a high degree of control over the research data at any stage.
The presentation of the data on the web platform uses a frontend application implemented using a common JavaScript framework. It combines two interacting services:
- a backend application consisting of XSLT transformation pipelines and
- an efficient search index.
A platform specific editing environment in the form of a framework and plugin for oXygen XML Editor is used to facilitate data authoring. In principle, the XML code is rendered using CSS rules to offer a user friendly editing interface. The environment integrates data lookup queries for entity linking and makes use of specific backend endpoints, for instance for the retrieval of the next available ID or updating the index.
All data and software of hallerNet are released under free licenses.
Governance and contact
The hallerNet platform is operated by the Historical Institute of the University of Bern. Please refer to https://hallernet.org/about for more information.
Contact: contact@hallernet.org
About this Documentation
The goal of this documentation is to provide an overview over the data model and software that constitutes the technical base of the hallerNet platform.
Core Principles
Architectural decisions were influenced by the following aspects:
- research data as first class citizen
- modularity
- conformance to standards (technical standards and community standards)
- facilitate flexible use of XML tooling and languages
Data
Research Data as First Class Citizen
A strong motivation for the chosen design/architecture emerged from the history of Haller research in Bern and its use and re-use of born-digital research data over many years. The value of the data and its description quickly became clear when porting the data from a (relatively) closed database to a newly created open environment. In order to keep the data at least equally accessible and extensible in the future, it seems sensible to avoid tool and system dependencies on the level of data as much as possible. Consequently, research data was put at the centre of the infrastructure and the tools and workflows were built around it.
In more operational terms, all data is stored in plain text formats on a conventional file system[1] and subject to revision control (Git). In order to restrict version control to significant changes and to ignore other changes (e.g. with regard to insignificant whitespace), (server-side) file storage and transformation runs make use of unified file serialisation.
Data model / TEI
The data model of the platform consists of two major parts:
- structural data, i.e. metadata on the entities
- the extensible hallerNet edition model detailing the transcription methods
These parts are described in more detail in the data model documentation and the edition model documentation.
Formally, the model is specified in a number of Schematron schemas and a more global ODD description.
FAIR data repository
The research data is periodically published on the Zenodo FAIR data repository in its entirety under a free license (CC BY 4.0).
Backend Application
The hallerNet backend application serves the platform data in formats consumable by the frontend application and external services (such as CorrespSearch).
Conceptually, a series of data conversions are at play, orchestrated based on the structure of the URL request. For its suitability to process XML files (in a templating fashion based on pattern matching) as well as its versatility, XSLT 3.0 was chosen as predominant application programming language.
The implementation of the backup application is based on the open source web application framework XSLWeb, which is an accessible and pragmatic successor of pipeline-based frameworks such as Apache Cocoon. It is a server based application (written in Java) acting as orchestrating entity that parses the client requests and invokes conversion pipelines.
General approach and structure
A modularised request-dispatcher stylesheet acts as the backbone of the conversion pipelines. Depending on the structure of the client request, data is selected, combined, enriched, and finally transformed to an output format (XML, JSON, Markdown) or posted to other services (such as the search index).[2]
Illustration of the principle
A simple example may illustrate how the application works:
The request /api/data/platform/about/xml
by the frontend application is first processed by the main request dispatcher file (request-dispatcher.xsl
), which contains just a handful template rules, none of which matches the pattern of the request. Besides the locally defined rules, six modules are included:
<xsl:include href="request-dispatchers/req-disp-apis.xsl"/>
<xsl:include href="request-dispatchers/req-disp-data.xsl"/>
<xsl:include href="request-dispatchers/req-disp-file.xsl"/>
<xsl:include href="request-dispatchers/req-disp-index.xsl"/>
<xsl:include href="request-dispatchers/req-disp-lod.xsl"/>
<xsl:include href="request-dispatchers/req-disp-sitemap.xsl"/>
The request is being evaluated against the templates in these modules and a match is found in the data module:
<xsl:template name="api-data-platform-about-xml" match="/req:request[matches(req:path ,'^/api/data/platform/about/xml/?$')]">
<pipeline:pipeline>
<pipeline:transformer name="pipeline-step1" xsl-path="lib/api/data/about-xml.xsl" log="true"/>
</pipeline:pipeline>
</xsl:template>
As the structure of the request (req:request)
matches the given regular expression, a pipeline is invoked. Specifically, the transformation process defined in about-xml.xsl
is launched. As a general rule, the paths were the stylesheets are found, reflects the lib/api/data/
reflects the module name.
Modules of the backend application (and API)
Module name | (Main) tasks covered |
---|---|
APIs | API documentation |
data | Representations of entity and edition files, taxonomy files and other platform materials |
file | File system operations (query, get most recent, delete)[3] |
index | Populating the search index |
LOD | ID-based entity resolving, generation of exchange formats |
sitemap | Sitemap files for platform contents (currently being refactored) |
Frontend Application
The hallerNet frontend is a web application behaving as a single-page application (SPA). As such, the app dynamically rewrites the current page based on data from the backend data API and additional endpoints.
It is built with Quasar, an open-source and Vue.js-based framework which provides state-of-the-art UI components.
Pages
The frontend has several views:
- About (Example)
- Actors (Example)
- Data Entity (Example)
- Data Register (Example)
- Edition (Example)
- Edition Entity (Example)
- Landing (Example)
TEI to HTML transformation
The transformation of XML/TEI-encoded transcriptions to HTML - what can be displayed in the web browser - is one of the frontend's core functions.
Roughly speaking, this works as follows:
- The backend delivers XML/TEI as serialized string in JSON to the frontend.
- Here, on the client side, the received XML string is parsed, transformed into Vue's HTML-based template syntax using DOM methods[4] and serialized again.
- The serialized template is integrated into the reactivity system of Vue using dynamic components.
Deployment
Auto-deployment is set up for the development environment so that code changes are automatically transferred to a development server after each commit. Here, the changes can be tested before they are published in production.[5]
Search Index
The chosen file system-based approach of the architecture necessitates an encompassing search index that always reflects the current state of the data in order to facilitate rapid queries. This is essential for the search and filter functionalities available in the frontend/web portal, but it is also very convenient for data editing and data processing tasks.
For this reason, the platform maintains an instance of Apache Solr, a mature and widely-used open-source search engine that is equally well-suited for XML data and workflows and web development (JSON) and offers a wide range of features and configuration options.
General approach and structure
The configuration of the index is maintained in a managed-schema
file, which is edited manually and kept under revision control.[6]
The indexing of the file contents and the references between files is done by HTTP Post requests by the backend application. For the most part, these requesta are initiated by the editing environment. In particular, saving and closing a file always triggers (re-)indexing of the file in question.
Interplay of backend application and search index
The backend-application interacts with Solr in a bidirectional manner: on the one hand data indexing is initiated during (indexing) pipeline runs and on the other hand indexes are consumed by other (non-indexing) pipeline runs for reasons of efficiency (index queries are vastly quicker than filesystem queries).
As far as this creates a circular dependency, data indexing plays a more fundamental role than querying the index:
data creation most fundamental
data indexing ⋀
index querying ⋁
data delivery most subjacent / dependent
Image Server
Since 2024/Q1 all digitised images owned by hallerNet are part of the Haller collection on the UB IIIF server and integrated using the IIIF protocols. Refer to https://www.ub.unibe.ch/service/open_science/iiif_service_fuer_bilder for more information.
In earlier stages of development, images were served on a self-hosted image server (IIPImage as a Docker application on a DSM NAS, 2019-2023). hallerNet initiated the development of the IIIF hosting service by the University Library of Bern.
Authoring Environment
Centerpiece of the hallerNet's authoring environment is a custom framework for Oxygen XML Editor. In conjunction with an authoring plugin, this framework offers a graphical interface that assists the creation of (highly structured) entity files and source transcriptions. It takes care of ID assignment, offers convenient methods to supply taxonomy-based information and to create links to other files, provides quality and integrity checks and synchronises XML encodings with the search index of the platform.
The framework integrates type-specific data templates and input masks with custom operations and menus, some of which are based on Ediarum developments, as well as calls to the backend API.
Built-in functionality
When editing files using the authoring environment a number of actions happen in the background. While this may require a little patience in some cases, the actions are always motivated by user benefits.
NOTE
Creation of a new file (based on type-specific template)
- Querying last assigned ID and assign the following ID to the new file
- Storing the hull of the file on the server
- Indexing the ID and label in the search index -- this allows to find the entry in the frontend immediately
Opening an existing file
- Fetching taxonomy information for displaying the taxonomy dropdowns
Editing an existing file
- File validation; for instance, a warning is issued if a newly assigned norm data identification is already present in the system
- Rendering of the author mode views; triggering structural changes to the XML (addition/removal of elements/attributes) using the buttons requires re-rendering of the authoring view and the editor may feel frozen for a short while
Closing a file
- Full indexing run; this enables an always up-to-date index and frontend presentation
- Data transformation on closing carrying out operations depending on the object type (e.g. renumbering footnotes after changes for transcriptions, fetching geo coordinates for places based on Geonames ID)
Dropdown-based entity linking
- Search query over all entities of a given type and rendering the dropdown list; this enables immediate linking of newly created files
Using oXygen's built-in framework and plugin approaches allows to roll out new versions to the whole team without the need for additional communication. When launching the program, users are notified of new versions, which, upon confirmation, are fetched from the server and installed locally in a smooth manner.
Hosting and Deployment
All hallerNet services are currently run in a Virtuozzo cloud computing environment in Swiss data centres (operated by Hidora).
The only exceptions to this setup are a simple legacy file server (run as a Github page) and frontend pre-release instances (auto-deployed to netlify infrastructure).
Roadmap and future developments
Concrete proposals
- Introduction of globally unique persistent identifiers (GUPIs).
- Extension of the existing ID system to allow for a greater number of objects.
- RDF endpoints are developed in exchange with the SwissBritNet SNSF research project.
Under consideration
- More global unification of file serialisation, e.g. using Git hooks.
Specific information on implementation, configuration and development of certain components is being maintained directly in the software repositories.
Further reading
- Ute Recker Hamm, Martin Stuber, Haller Online – Konzept für den Umbau, Ausbau und die langfristige Sicherung der Haller /OeG Datenbank [8.6.2015] Some aspects of the concept were altered during the implementation phase (e.g. XML database vs. file system).
- Peter Dängeli, Christian Forney, Martin Stuber: Vom Stellenkommentar zum Netzwerk und zurück: große Quellenkorpora und tief erschlossene Strukturdaten auf hallerNet. DHd 2019, Frankfurt, 29. März 2019.
- Peter Dängeli, Christian Forney: Referencing annotations as a core concept of the hallerNet edition and research platform. TEI 2019, Graz, 18. September 2019.
- Antonio Rojas Castro (2019). Modeling FRBR entities and their relationships with TEI. TEI 2019, Graz, 18. September 2019. DOI zenodo.3446218
- Peter Dängeli, Martin Stuber (2020). Nachhaltigkeit in langjährigen Erschliessungsprojekten. FAIR-Data-Kriterien bei Editions- und Forschungsplattformen zum 18. Jahrhundert. xviii.ch : Jahrbuch der Schweizerischen Gesellschaft zur Erforschung des 18. Jahrhunderts, 11, pp. 34-51. Schwabe DOI 10.24894/2673-4419.00004
- Christian Forney, Martin Stuber (2024). Connecting floras and herbaria before 1850 – challenges and lessons learned in digital history of biodiversity. Digital History Switzerland 2024 (DigiHistCH24), University of Basel. DOI 10.5281/zenodo.13768615
On the multiple advantages of this approach see https://wiki.c2.com/?PowerOfPlainText ↩︎
HTTP requests are issued using the XSLWeb implementation of the EXPath HTTP Client Module. ↩︎
These methods are called from transformation pipelines, but are also essential to functionalities of the editing environment. ↩︎
Two JavaScript libraries from Sebastian Zimmer are used for this purpose: X.js facilitates the query and modification of XML fragments by providing element- and attribute-specific methods. TEI2Vue.js transforms XML into Vue's HTML-based template syntax with the help of X.js. ↩︎
The Git repository for the frontend is linked to the cloud provider Netlify (Continuous Deployment, CD). The development version (main branch) is available at https://haller.netlify.app. ↩︎
As changes to the schema are developed and tested locally and then deployed to the server, the generally prefered way of modifying the Solr schema by API is not advantageous. ↩︎