CS4Biodiversity

Format

CS4BioDiversity is a workshop at the INFORMATIK2021 conference.

Background

Preserving biodiversity is a necessary prerequisite for achieving (almost) all of the seventeen goals for sustainable development defined by the United Nations. A better understanding of biodiversity, in turn, is a prerequisite for its preservation. Computer science can make essential contributions to this improved understanding. The workshop CS4Biodiversity would like to show such contributions and stimulate computer science research for the conservation of biodiversity from automatic data acquisition to modern methods for data analysis and visualization to the improvement of the research data infrastructure (especially in the context of efforts to establish a national research data infrastructure).

Exemplary subject areas are:

Support for the recording of biodiversity (e.g. sensor networks, automatic evaluation of Cameratraps, eDNA)
Biodiversity data management (e.g. efficient data collection and data management including ontology management, data integration, knowledge graphs, time series, semantic web and Wikidata, automatic annotation of data, citizen science, processing and documentation of data for long-term archiving)
Analysis of biodiversity data (e.g. deep learning, app development for diagnostics and identification, visualization, workflow systems)
Simulations for biodiversity research (e.g. modeling software, reproducibility of simulations, provenance)
Support for the software-based digital long-term archiving of biodiversity data (e.g. software for the digital long-term archiving of biodiversity data, automated and standardized interfaces, AI-supported formation of format groups in the context of preservation, partially automated formation of significant properties, restoration or safeguarding of the interpretability of information stored in legacy data formats, tools for transformation between community metadata schemas and standards)

The workshop will combine invited talks, research talks and a Software Marketplace with demos.

Call for Papers and Demos

We invite the submission of short and full papers and of proposals for demos at the Software Marketplace related to the topics above or other topics in the area of Computer Science for Biodiversity. Full papers must not exceed 18 pages and short papers must not exceed 6 pages. Both full and short papers need to be formatted according to the LNI formatting guidelines https://gi.de/service/publikationen/lni and submitted to Easychair:

https://easychair.org/conferences/?conf=cs4biodiversity

Software descriptions can be submitted either as papers (to be included in the proceedings) or as abstracts (to be published in the program leaflet, only).

Submissions are welcome in English or German. All submissions will be reviewed by members of the program committee. Submissions may be directly accepted, asked for a revision (resulting in another round of reviews), or rejected.

Software marketplace

Software marketplace at the Computer Science for Biodiversity workshop at INFORMATIK 2021 – annual meeting of the Gesellschaft für Informatik e.V. on September 27th, 2021

Have you developed a software solution with potential for the Biodiversity research or are in the process of implementing such a tool?

Do you want to present your Data management platform, visualization tool, citizen science App or another solution to the research community?

Then submit a brief description of the software solution and how you want to demonstrate it (live demo, video or poster) by July 15. In addition to the presentation you choose, you
can introduce your software to workshop attendees in a one-minute “teaser” and talk to them directly at a “market stall”.
The summary should not exceed one page (max. 2000 characters with spaces).

The link for the submission is:
https://easychair.org/conferences/?conf=cs4biodiversity
Submission Deadline: July 15 2021

Important Dates

17.05.2021 Paper Submission Deadline

09.06.2021 Notification

18.06.2021 Deadline for Revisions

25.06.2021 Notification for Revisions

30.06.2021 Submission deadline for camera ready papers GI-Edition: Lecture Notes in Informatics (LNI)

15.07.2021 Submission of Abstracts for the Software Marketplace

22.07.2021 Notification for submissions to the Marketplace

30.07.2021 Submission deadline for camera ready abstracts

27.09.2021 Workshop

Programm

Monday, September 27 2021

10:00 – 10:05 Welcome

10:05 – 10:45 Invited Talk: Frank Oliver Glöckner: NFDI4Biodiversity

10:45 – 12:15 Paper Session 1 (15 minutes talk + 5 min discussion/paper)

Jonas Höchst, Jannis Gottwald, Patrick Lampe, Julian Zobel, Thomas Nauss, Ralf Steinmetz and Bernd Freisleben: "tRackIT OS: Open-source Software for Reliable VHF Wildlife Tracking"

Dimitri Korsch, Paul Bodesheim and Joachim Denzler: "Deep Learning Pipeline for Automated Visual Moth Monitoring: Insect Localization and Species Classification"

Julia Böhlke, Dimitri Korsch, Paul Bodesheim and Joachim Denzler: "Exploiting Web Images for Moth Species Classification"

Dagmar Triebel, Ariane Grunz, Stefan Seifert, Anton Link and Gerhard Rambold: "DiversityNaviKey, a Progressive Web Application for interactive diagnosis and identification"

12:15 – 13:00 Coffee break (on your own)

13:00 – 14:30 Paper Session 2 (15 minutes talk + 5 min discussion/paper)

Julian Zobel, Paul Frommelt, Patrick Lieser, Jonas Höchst, Patrick Lampe, Bernd Freisleben and Ralf Steinmetz: Energy-efficient Mobile Sensor Data Offloading via WiFi using LoRa-based Connectivity Estimations

Matthias Körschens, Paul Bodesheim, Christine Römermann, Solveig Franziska Bucher, Mirco Migliavacca, Josephine Ulrich and Joachim Denzler: Automatic Plant Cover Estimation with Convolutional Neural Networks

Daphne Auer, Paul Bodesheim, Christian Fiderer, Marco Heurich and Joachim Denzler: Minimizing the Annotation Effort for Detecting Wildlife in Camera Trap Images with Active Learning

Dina Sharafeldeen, Mohamed Bakli, Alsayed Algergawy and Birgitta König-Ries: ISTMINER: Interactive Spatiotemporal Co-occurrence Pattern Extraction: A Biodiversity case study

14:30 – 14:45 Lunch Break (on your own)

14:45 – 15:30 Short Paper Session (10 minutes talk+5 minutes discussion/paper)

Clemens-Alexander Brust, Björn Barz and Joachim Denzler: "Carpe Diem: A Lifelong Learning Tool for Automated Wildlife Surveillance"

Andreas Kohlbecker, Anton Güntsch, Norbert Kilian, Wolf-Henning Kusber, Katja Luther, Andreas Müller, Eckhard von Raab-Straube and Walter Berendsohn: "A pragmatic approach to concept-based annotation of scientific names in biodiversity and environmental research data"

Samira Babalou, David Schellenberger Costa, Jens Kattge, Christine Römermann and Birgitta König-Ries: "Towards a Semantic Toolbox for Reproducible Knowledge Graph Generation in the Biodiversity Domain – How to Make the Most out of Biodiversity Data"

15:30 – 15:45 1 Minute Flashtalks: Software Market Place

15:45 – 17:15 Software Market Place

Frank Broda, Frank Lange, Fabian Mauz and Ludger Wessjohann: "The Cloud Resource & Information Management System (CRIMSy): Supporting collaboration and data management in the life sciences"

Christian Beilschmidt, Johannes Drönner, Michael Mattig and Bernhard Seeger: "Geo Engine: A flexible Dashboard for Exploring and Analyzing Biodiversity Data"

Luise Quoß, Henrique Pereira, Néstor Fernandez, José Valdez and Christian Langer: "ebvnetcdf R package"

Christian Langer, Néstor Fernández, Jose Valdez, Luise Quoss and Henrique Pereira: "Cataloging Essential Biodiversity Variables with the EBV Data Portal"

Alsayed Algergawy, Hamdi Hamed and Birgitta König-Ries: "JeDaSS: A Tool for Dataset Summarization and Synthesis"

Felicitas Löffler, Fateme Shafiei, Sven Thiel, Kobkaew Opasjumruskit and Birgitta König-Ries: "[Dai:Si] – Modular dataset retrieval for biological data"

Björn Quast, Christian Bräunig and Peter Grobe: "Bridging the GAP in Metabarcoding Research: A shared repository for ASV tables"

Jitendra Gaikwad, Roman Gerlach, David Schöne, Sven Thiel, Franziska Zander and Birgitta König-Ries: "BEXIS2: A data management platform for mid-to-large scale research projects to facilitate making biodiversity research data Findable, Accessible, Interoperable and Accessible"

Steffen Ehrmann: "geometr – Generate and modify interoperable geometric shapes"

Till-Hendrik Macher, Arne J. Beermann and Florian Leese: "TaxonTableTools – A comprehensive, platform-independent graphical user interface software to explore and visualise DNA metabarcoding data"

Ariane Grunz, Dagmar Triebel, Stefan Seifert, Anton Link and Gerhard Rambold : "Live Demo of DiversityNaviKey, a Progressive Web Application for interactive diagnosis and identification"

Abstracts

Jonas Höchst, Jannis Gottwald, Patrick Lampe, Julian Zobel, Thomas Nauss, Ralf Steinmetz and Bernd Freisleben: tRackIT OS: Open-source Software for Reliable VHF Wildlife Tracking

Abstract:

We present tRackIT OS, open-source software for reliable VHF radio tracking of (small) animals in their wildlife habitat. tRackIT OS is an operating system distribution for tRackIT stations that receive signals emitted by VHF tags mounted on animals and are built from low-cost commodity-off-the-shelf hardware. tRackIT OS provides software components for VHF signal processing, system monitoring, configuration management, and user access. In particular, it records, stores, analyzes, and transmits detected VHF signals and their descriptive features, e.g., to calculate bearings of signals emitted by VHF radio tags mounted on animals or to perform animal activity classification. Furthermore, we provide results of an experimental evaluation carried out in the Marburg Open Forest, the research and teaching forest of the University of Marburg, Germany. All components of tRackIT OS are available under a GNU GPL 3.0 open source license at https://github.com/nature40/tRackIT-OS.

Dimitri Korsch, Paul Bodesheim and Joachim Denzler: Deep Learning Pipeline for Automated Visual Moth Monitoring: Insect Localization and Species Classification

Abstract:

Biodiversity monitoring is crucial for tracking and counteracting adverse trends in population fluctuations. However, automatic recognition systems are rarely applied so far, and experts evaluate the generated data masses manually. Especially the support of deep learning methods for visual monitoring is not yet established in biodiversity research, compared to other areas like advertising or entertainment. In this paper, we present a deep learning pipeline for analyzing images captured by a moth scanner, an automated visual monitoring system of moth species developed within the AMMOD project. We first localize individuals with a moth detector and afterward determine the species of detected insects with a classifier. Our detector achieves up to 99.01% mean average precision and our classifier distinguishes 200 moth species with an accuracy of 93.13% on image cutouts depicting single insects. Combining both in our pipeline improves the accuracy for species identification in images of the moth scanner from 79.62% to 88.05%.

Julia Böhlke, Dimitri Korsch, Paul Bodesheim and Joachim Denzler: Exploiting Web Images for Moth Species Classification

Abstract:

Due to shrinking habitats, moth populations are declining rapidly. An automated moth population monitoring tool is needed to support conservationists in making informed decisions for counteracting this trend. A non-invasive tool would involve the automatic classification of images of moths, a fine- grained recognition problem. Currently, the lack of images annotated by experts is the main hindrance to such a classification model. To understand how to achieve acceptable predictive accuracies, we investigate the effect of differently sized datasets and data acquired from the Internet. We find the use of web data immensely beneficial and observe that few images from the evaluation domain are enough to mitigate the domain shift in web data. Our experiments show that counteracting the domain shift may yield a relative reduction of the error rate of over 60\%. Lastly, the effect of label noise in web data and proposed filtering techniques are analyzed and evaluated.

Dagmar Triebel, Ariane Grunz, Stefan Seifert, Anton Link and Gerhard Rambold: DiversityNaviKey, a Progressive Web Application for interactive diagnosis and identification

Abstract:

DiversityNaviKey is designed as a diagnostic tool in the field of biology and related sciences to identify organisms as well as to interactively select other entities and objects related to research¬ based on a set of predefined properties. It allows queries on structured sources of descriptive data (trait data) to diagnose groups of objects based on combinations of optionally modified descriptor-states or values that are selected consecutively during the diagnostic process. The Web App is implemented as a single-page application and contains the entire presentation logic to dynamically change a pre-generated HTML page in the browser. The content data is accessed via a web service as JSON packages. DiversityNaviKey is a progressive web application that uses caching mechanisms of browsers, such as Service Worker and IndexedDB. Thus, the main tasks are also available in offline mode. The current set up uses the SNSB technical infrastructure and data pipelines with PostgreSQL Cache Database and the data management tool DiversityDescriptions as backend. The exemplary data sources are from various domains, two of them are large datasets from the long-term projects DEEMY and LIAS.

Julian Zobel, Paul Frommelt, Patrick Lieser, Jonas Höchst, Patrick Lampe, Bernd Freisleben and Ralf Steinmetz: Energy-efficient Mobile Sensor Data Offloading via WiFi using LoRa-based Connectivity Estimations

Abstract:

Animal monitoring in natural habitats provides significant insights into the animals’ behavior, interactions, health, or external influences. However, the sizes of monitoring devices attachable to animals strongly depends on the animals’ sizes, and thus the range of possible sensors including batteries is severely limited. Gathered data can be offloaded from monitoring devices to data sinks in a wireless sensor network using available radio access technologies, but this process also needs to be as energy-efficient as possible. This paper presents an approach to combine the benefits of high-throughput WiFi and robust low-power LoRa communication for energy-efficient data offloading. WiFi is only used when connectivity between mobile devices and data sinks is available, which is determined by LoRa-based distance estimations without the need for additional GPS sensors. A prototypical implementation on low-end commodity-off-the-shelf hardware is used to evaluate the proposed approach in a German mixed forest using a simple path loss model for distance estimation. The system provides an offloading success rate of 87%, which is similar to that of a GPS-based approach, but with around 37% less power consumption.

Matthias Körschens, Paul Bodesheim, Christine Römermann, Solveig Franziska Bucher, Mirco Migliavacca, Josephine Ulrich and Joachim Denzler: Automatic Plant Cover Estimation with Convolutional Neural Networks

Abstract:

Monitoring the responses of plants to environmental changes is essential for plant biodiversity research. This, however, is currently still being done manually by botanists in the field. This work is very laborious, and the data obtained is, though following a standardized method to estimate plant coverage, usually subjective and has a coarse temporal resolution. To remedy these caveats, we investigate approaches using convolutional neural networks (CNNs) to automatically extract the relevant data from images, focusing on plant community composition and species coverages of 9 herbaceous plant species. To this end, we investigate several standard CNN architectures and different pretraining methods. We find that we outperform our previous approach at higher image resolutions using a custom CNN with a mean absolute error of 5.16%. In addition to these investigations, we also conduct an error analysis based on the temporal aspect of the plant cover images. This analysis gives insight into where problems for automatic approaches lie, like occlusion and likely misclassifications caused by temporal changes.

Daphne Auer, Paul Bodesheim, Christian Fiderer, Marco Heurich and Joachim Denzler: Minimizing the Annotation Effort for Detecting Wildlife in Camera Trap Images with Active Learning

Abstract:

Analyzing camera trap images is a challenging task due to complex scene structures at different locations, heavy occlusions, and varying sizes of animals. One particular problem is the large fraction of empty images in recorded datasets because the motion detector gets often triggered by signals other than animal movements. To identify these empty images automatically, we use an active learning approach to train binary classifiers with small amounts of labeled data in order to keep the annotation effort of humans minimal. We particularly focus on distinct models for daytime and nighttime images and follow a region-based approach by training classifiers for single sites or small sets of camera stations. We evaluate our approach using camera trap images from the Bavarian Forest National Park and achieve comparable or even superior performance to publically available detectors trained with millions of labeled images while requiring significantly smaller amounts of annotated training images.

Dina Sharafeldeen, Mohamed Bakli, Alsayed Algergawy and Birgitta König-Ries: ISTMINER: Interactive Spatiotemporal Co-occurrence Pattern Extraction: A Biodiversity case study

Abstract:

In recent years, the exponential growth of spatiotemporal data has led to an increasing need for new interactive methods for accessing and analyzing this data. In the biodiversity domain, species co-occurrence models are critical to gain a mechanistic understanding of the processes underlying biodiversity and supporting its maintenance. This paper introduces a new framework that allows users to explore species occurrences datasets at different spatial and temporal periods to extract co-occurrence patterns. As a real-world case study, we conducted several experiments on a subset of the Global Biodiversity Information Facility (GBIF) occurrences dataset to extract species co-occurrence patterns interactively. For better understanding, these co-occurrence patterns are visualized in a map view and as a graph. Also, the user can export these patterns in CSV format for further use. For many queries, runtimes are in a range that allows for interaction already. Further optimizations are on our research agenda.

Clemens-Alexander Brust, Björn Barz and Joachim Denzler: Carpe Diem: A Lifelong Learning Tool for Automated Wildlife Surveillance

Abstract:

We introduce Carpe Diem, an interactive tool for object detection tasks such as automated wildlife surveillance.

It reduces the annotation effort by automatically selecting informative images for annotation, facilitates the annotation process by proposing likely objects and labels, and accelerates the integration of new labels into the deep neural network model by avoiding re-training from scratch.

Carpe Diem implements active learning, which intelligently explores unlabeled data and only selects valuable examples to avoid redundant annotations. This strategy saves expensive human resources. Moreover, incremental learning enables a continually improving model. Whenever new annotations are available, the model can be updated efficiently and quickly, without re-training, and regardless of the amount of accumulated training data. Because there is no single large training step, the model can be used to make predictions at any time. We exploit this in our annotation process, where users only confirm or reject proposals instead of manually drawing bounding boxes.

Andreas Kohlbecker, Anton Güntsch, Norbert Kilian, Wolf-Henning Kusber, Katja Luther, Andreas Müller, Eckhard von Raab-Straube and Walter Berendsohn: A pragmatic approach to concept-based annotation of scientific names in biodiversity and environmental research data

Abstract:

With the increasing amount of interdisciplinary and international networks dedicated to long-term persistence and interoperability of research data, the demand for semantic linking of environmental research data has grown. Data related to organisms frequently inherit a major obstacle. Organisms often are only ambiguously identified by using only the scientific name, which is not a precise identifier for the taxonomic concept that implicitly is being used. Here we describe a new strategy for managing concepts in taxonomic research projects in a way that assures stability of concepts by allowing concepts to be transformed into new ones at well defined transition points. These taxonomic concepts can be assigned with persistent identifiers suitable for semantically correct annotation of environmental data sets.

Samira Babalou, David Schellenberger Costa, Jens Kattge, Christine Römermann and Birgitta König-Ries: Towards a Semantic Toolbox for Reproducible Knowledge Graph Generation in the Biodiversity Domain – How to Make the Most out of Biodiversity Data

Abstract:

Knowledge Graphs are widely regarded as one of the most promising ways to manage and link information in the age of Big Data. Their broad uptake is still hindered by the large effort required to create and maintain them, though. In this paper, we propose the creation of a semantic toolbox that will support data owners in transforming their databases into reproducible, dynamically extendable knowledge graphs that can be integrated and jointly used. We showcase the need, potential benefits and first steps towards the solution in our example domain, biodiversity research.

Frank Broda, Frank Lange, Fabian Mauz and Ludger Wessjohann: The Cloud Resource & Information Management System (CRIMSy): Supporting collaboration and data management in the life sciences

Abstract:

Information integration among collaborating research groups is still challenging in many scientific disciplines. Repositories are invoked in late stages of the research data lifecycle, typically for publication and archiving. However, in early stages of research, there is a need for sharing unpublished documents and data among research groups within trustful digital environments, which ideally should support discipline specific data types and processes. Existing solutions often miss such a support and limit the data autonomy of their users. The Cloud Resource & Information Management System (CRIMSy) tries to close this gap by providing a distributed data infrastructure for the life sciences. Within this infrastructure each participating institution maintains its own node with documents and data provided by its researchers. Federated search and document retrieval across the cloud has been designed with semantic integration and chemical understanding in mind and will also allow searches in biological sequence entities using established bioinformatics tools soon. We recently extended the software with a simple electronic lab notebook (ELN) and a chemistry-aware storage and inventory system to aid documentation of both laboratory and field work. Our federation concept may be scaled from small and informal research projects to large multi-national research consortia. Fine grained permissions for user groups as well as individual users control access privileges. Thus, each institution retains full control over its data. A single node can be member of multiple cloud instances. Within each cloud instance, trust is established via mutual certificate-based authentication. At the same time, CRIMSy is easy to administrate and requires only minor computational resources. We are developing CRIMSy as open-source software and also offer independently usable software modules such as database extensions and UI components.

Project Website: https://github.com/ipb-halle/CRIMSy

Christian Beilschmidt, Johannes Drönner, Michael Mattig and Bernhard Seeger: Geo Engine: A flexible Dashboard for Exploring and Analyzing Biodiversity Data

Abstract:

Geo Engine is an online research environment for spatio-temporal data processing. It consists of four basic building blocks: data provision, processing, interfaces, and visualization. It stores and accesses data from various sources. The data processing automatically performs transformations and alignment of geographical and temporal information. Users interact with Geo Engine either via web-based graphical dashboards, via standardized APIs (OGC), or libraries for programming languages, such as Python.

Our platform is especially suited for biodiversity researchers. It is interoperable and allows fast and intuitive data exploration. The native time support allows for time-series analysis which is crucial when investigating trends and developments in biodiversity. Analyses are accessible as reproducible workflows and thus reusable and comprehensible. The workflows in combination with our modular software stack allow a seamless construction of specialized data portals for interested groups or decision-makers.

In this presentation, we give an overview of Geo Engine and its dashboard capabilities in the biodiversity domain. Concretely, we show an example with observation data that is harvested from collections and data centers of GFBio partners. We map this data and combine it with external remote sensing data from ESA’s Sentinel 2 satellite. In addition, we introduce Geo Engine’s capabilities to define workflows for data processing in an ad-hoc manner and access their results via Geo Engine’s Python library. In summary, Geo Engine is a powerful platform for realizing flexible dashboards and providing rich data access based on the FAIR principles.

Luise Quoß, Henrique Pereira, Néstor Fernandez, José Valdez and Christian Langer: ebvnetcdf R package

Abstract:

Multidimensional geospatial data is increasingly used in biodiversity research. This data can cover spatiotemporal estimates of biodiversity metrics using models and projection scenarios, and biodiversity products derived from remote sensing. However, the disparity of formats and criteria used to arrange the data severely limits their interoperability. The Essential Biodiversity Variables (EBV) datasets are defined as measurements providing essential information for analysing, tracking and reporting the state of biodiversity. A data and metadata standard has recently been developed to consistently organize and document the EBV cubes. The EBV data cubes are defined along the three dimensions of space, time and biological entities (e.g. species or types of ecosystems). These cubes are organized in hierarchical groups to allow for multiple biodiversity metrics and scenario projections per cube. However, tools that facilitate the production of these EBV cubes have been missing. In this demo, we present the ebvnetcdf R package, a tool tailored to produce EBV cubes using a specification of the netCDF format and compliant metadata with the ACDD and CF conventions. The package functionality covers access to existing EBV netCDF datasets as well as the creation of new ones. The user can retrieve the metadata, which is distributed across the different netCDF components. Different visualization functions are available for fast data exploration, covering the temporal and spatial scope of the data. Specific functions have been implemented to access data subsets and to perform spatial resampling of the data in order to spatially align multiple EBV cubes. The creation of the EBV datasets can be accomplished in interaction with the EBV Portal and the R package. Finally, the created EBV datasets can be uploaded and shared through the EBV Portal. Together with the EBV Portal, ebvnetcdf facilitates the exchange of interoperable scientific biodiversity data.

Christian Langer, Néstor Fernández, Jose Valdez, Luise Quoss and Henrique Pereira: Cataloging Essential Biodiversity Variables with the EBV Data Portal

Abstract:

Essential Biodiversity Variables (EBVs) are used to monitor the status and trends in biodiversity at multiple spatiotemporal scales. These provide an abstraction level between raw biodiversity observations (spatially explicit data eg. geoTIFFs) and indicators (graphs/diagrams showing changes of a metric over a period of time, e.g. total amount of natural habitat over time), enabling better access to policy-relevant biodiversity information. Furthermore, the EBV vision aims to support detection of critical change, among other things, with easy to use tools and dashboards accessible to a variety of users and stakeholders.

We present the EBV Data Portal, a platform for distributing and visualising EBV datasets. It contains a geographic cataloging system that supports a large number of spatiotemporal description features and enables their discoverability. To facilitate user interaction, it offers a web-based interface where users can (1) share and/or (2) find essential biodiversity spatiotemporal data through intuitive interaction with cataloging and visualisation tools. Using the EBV Catalog module, the user can explore the characteristics of the data based on the definition of an EBV Minimum Information metadata standard. The Catalog also allows to browse the description of the metadata as both the ACDD standard (JSON) and the EML standard (XML). This enables easy interoperability with other metadata catalogs.

An example application is the calculation of EBV summary statistics for selected countries and areas. Using the EBV Data Portal, users can select EBVs and calculate basic biodiversity change metrics from spatiotemporal subsets and visualise conveniently complex, multidimensional biodiversity datasets. These visualisation and analysis tools of the EBV Data Portal are a first step towards an EBV-based dashboard for biodiversity analyses.

Alsayed Algergawy, Hamdi Hamed and Birgitta König-Ries: JeDaSS: A Tool for Dataset Summarization and Synthesis

Abstract:

The Collaborative Research Center (CRC) AquaDiva is a large collaborative project spanning a variety of domains including biology, geology, chemistry, and computer science with the common goal to better understand the Earth’s critical zone. Datasets collected within AquaDiva, like those of many other cross-institutional, cross-domain research projects, are complex and difficult to reuse since they are highly diverse and heterogeneous. This limits the dataset accessibility to the few people who were either involved in creating the datasets or have spent a significant amount of time aiming to understand them. This is even true for scientists working in other parts of the same project. They, too, will need time to figure out the major theme of unfamiliar datasets. We believe that dataset analysis and summarization can be used as an elegant way to provide a concise overview of an entire dataset. This makes it possible to explore in-depth datasets of potential interest, only.

We develop JeDaSS, the Jena Dataset Summarization and Synthesis tool, which semantically classifies data attributes of tabular scientific datasets based on a combination of semantic web techniques and deep learning. This classification contributes to summarizing individual datasets, but also to link them to others. We believe that figuring out the subject of a dataset is an important and basic step in data summarization. To this end, the proposed approach categorizes a given dataset into a domain topic. With this topic, we then extract hidden links between different datasets in the repository. The proposed approach has two main phases: 1) off-line to train and build a classification model using a supervised deep learning approach and 2) online making use of the pre-trained model to classify datasets into the learned categories. To demonstrate the applicability of the approach, we analyzed datasets from the AquaDiva Data Portal.

Felicitas Löffler, Fateme Shafiei, Sven Thiel, Kobkaew Opasjumruskit and Birgitta König-Ries: [Dai:Si] – Modular dataset retrieval for biological data

Abstract:

Dataset search is a challenging and time-consuming task. In several studies, scholars report on difficulties they have in finding relevant datasets. Obstacles are for instance inadequate search tools or insufficient metadata descriptions with arbitrary keywords. In particular, data descriptions in biodiversity research are heterogenous and fuzzy hampering dataset retrieval based on a syntactic match of query terms and keywords in the dataset. In addition, scientific user requirements are very broad and can range from specific requirements that can be expressed in concrete query terms to more exploratory tasks requiring subject-based filtering.

In this demonstration, we present [Dai:Si], a modular framework for dataset search with a semantic search for biological data. It is based on a former search service developed in the scope of the GFBio project. This new version adheres to the recommendations of the RDA Data Discovery Interest Group (RDA DDIG) for data repositories to overcome the current problems in dataset search. Therefore, [Dai:Si] provides different query interfaces including a semantic search expanding query terms on related keywords obtained from GFBio’s Terminology Service. In addition, the system offers a faceted search, highlights important information for data reuse such as data access and data license at one glance, displays geo-coordinates on a map and allows data collection and data download. Modularity is one of the main aims of the framework. Therefore, domain and business specific logic are separated from functional components. This allows an easy setup of [Dai:Si] for different search indexes.

At the workshop, we will give a live demonstration of [Dai:Si] with GFBio’s search index. We will also present a second use case to demonstrate its usage for different dataset search indexes. The code is publicly available in our GitHub repository: https://github.com/fusion-jena/DaiSi.

Björn Quast, Christian Bräunig and Peter Grobe: Bridging the GAP in Metabarcoding Research: A shared repository for ASV tables

Abstract:

Metabarcoding is a tool to routinely identify species in environmental mass-samples and analyze the species composition of such samples. Using metabarcoding techniques in ecological, environmental, taxonomic research as well as in monitoring projects outperforms the traditional species identification by human experts in the amount, velocity, and quality when well curated reference data are available. Therefore, metabarcoding can be seen as the future standard method for all biological research areas where species occurrence, distribution, and species composition is in question.

A common outcome of metabarcoding research are so called ASV(Amplicon Sequence Variant) tables. These tables combine the extracted sequences of all sampling plots with the number of occurrences of each sequence within that different plots. To identify the species for each sequence a taxonomic assignment is done by doing a BLAST search against one or more reference databases like Barcode Of Life Data system (BOLD, boldsystems.org) or the German Barcode of Life library (bolgermany.de). The assigned tables provide a more or less good resolution of the detected species, depending on the quality and coverage of the reference libraries used.

Thus, the assignment of sequences in the ASV table is subject to changes in taxonomic annotation over time, which makes it difficult to make reliable statements about the occurrence of a species at a particular locality. At the same time, there is no central registry for sequences from metabarcoding, which makes analysis across multiple taxa and sample locations difficult and prone to error as sequences in different ASV tables from different studies might be assigned to different taxa. With growing barcode reference libraries, quality enhancement in species assignment, and the introduction of new marker sequences the species matching in ASV tables is growing and gains quality improvement over the time. Currently, a publicly available database system for the management of ASV tables is lacking and especially, there is no solution that keeps track with the enhancements and refinements in barcode reference libraries. To fill this gap, we develop an ASV Table registry as an online application and database that allows to:

– register ASV sequences, either by mass upload or on the fly, when submitting ASV-tables

– upload and manage ASV tables

– publish ASV-tables and make them cite-able with DOIs

– search for ASV sequences, ASV tables, containing sequences and occurrenceoccurence data

– identify sequences in ASV tables against different barcode reference databases

– keep track of the applied identification methods and parameters and the outcome of each identification

Such a dynamic registry for ASV tables ensures that published data with species assignments are accessible over the time and can be reused in further studies. Currently, ASV tables are stored as supplements to publications or in private repositories, making it difficult to find them. Reusing such ASV tables in other research projects is difficult when the used ASV sequences are not available. Furthermore, keeping track of the different species assignments in such tables is not possible when they are not accessible in a single registry.

The ASV table registry developed here aims to make the ASV tables FAIR in the sense of Findable, Accessible, Interoperable, and Reusable and to foster the shared use of ASV tables in different research projects.

The ASV Table registry is developed within the BMBF funded Project GBOL III – Dark Taxa and is presented in the software marketplace in a live demo.

Jitendra Gaikwad, Roman Gerlach, David Schöne, Sven Thiel, Franziska Zander and Birgitta König-Ries: BEXIS2: A data management platform for mid-to-large scale research projects to facilitate making biodiversity research data Findable, Accessible, Interoperable and Accessible

Abstract:

For biodiversity researchers acquiring data managed according to the FAIR data principles (Findable, Accessible, Interoperable, Accessible) is crucial for conducting meaningful research, potentially leading to better understanding and conservation efforts. However, biodiversity researchers, especially in mid-to-large scale research projects, are overwhelmed with the challenge to truly implement FAIR data principles due to inadequate research data management skills and lack of supporting infrastructure and tools.

To alleviate this pressure-point and achieve the objective of making biodiversity data reusable, BEXIS2 an open-source research data management system is developed (http://bexis2.uni-jena.de). The system is modular and extensible, containing features and functions that support different data lifecycle stages such as data collection, discovery, quality assurance, and integration. BEXIS2 is developed considering the needs and data management challenges of the mid-to-large scale research projects with multiple sub-projects involving up to several hundred researchers.

BEXIS2 is developed using the C# language, utilizes ASP.NET (Version 4.5.2) MVC and deployed on MS IIS, and uses PostgreSQL as the default relational database management system (https://github.com/BEXIS2). The main features of the system are (1) support multiple metadata and data standards, (2) creation of reusable data structures, variables and measurement units, (3) dataset version management, (4) facility to implement data sharing and access policy at researcher, project and cross-organization level, (5) data handling via the web user interface and Application Programming Interfaces (APIs), (6) interoperability with external data platforms such as the GFBio and DataCite, (7) single sign-on and (8) open-source modular system facilitating enhancement and customization as per the project needs. The demo version of the BEXIS2 is available at https://demo.bexis2.uni-jena.de.

Steffen Ehrmann: geometr – Generate and modify interoperable geometric shapes

Abstract:

Many biodiversity researchers make use of R as their preferred tool for data management. This tool’s popularity has eventually led to a large diversity of packages for all sorts of typical tasks. One such task of a biodiversity researcher is analysing data along thematic (species, biomes, etc.) and spatial gradients. Recently, R has seen many changes to how spatial data are managed and analysed, for instance, with the introduction of the sf, terra or stars packages.

With a plethora of less popular or legacy packages, a relatively large number of data standards and tool-chains are now available to work with spatial data in R. This comes with a lack of interoperability when packages have primarily been designed base on their merit and not with regard to functioning well with other tools. Therefore, truly reproducible tool-chains (also when running them in the future) can often not be realised, because either input or output standards of the packages used at that time change unpredictably.

Spatial classes are typically a collection of geometric shapes (or their vertices) accompanied by various metadata (such as attributes and a coordinate reference system). They are thus conceptually quite similar. The R package geometr makes use of this and provides “getters” and “setters” as interoperable tools for working with spatial objects. Getters produce output for functionally identical information (such as feature attributes or spatial dimensions) of different spatial classes in the same (interoperable) quality. Likewise, setters use an identical input to write to various classes that require different input. Both getters and setters play well with the popular standards of the tidyverse. The primary purpose of geometr is to simplify access to spatial information in a fashion familiar to data managers and make scripts and functions of a code designer more interoperable. Building a tool-chain based on geometr will also still result in reproducible scripts/functions when input formats of other packages have changed in the future, or new standards have appeared.

I would love to present the package as a live demo. https://ehrmanns.github.io/geometr/

Till-Hendrik Macher, Arne J. Beermann and Florian Leese: TaxonTableTools – A comprehensive, platform-independent graphical user interface software to explore and visualise DNA metabarcoding data

Abstract:

DNA-based identification methods, such as DNA metabarcoding, are increasingly used as biodiversity assessment tools in research and environmental management. Although powerful analysis software exists to process raw data, the translation of sequence read data into biological information and downstream analyses may be difficult for end users with limited expertise in bioinformatics. Thus, the need for easy-to-use, graphical user interface (GUI) software to analyze and visualize DNA metabarcoding data is growing. Here we present TaxonTableTools (TTT), a new platform-independent GUI that aims to fill this gap by providing simple, reproducible analysis and visualization workflows. The input format of TTT is a so-called “TaXon table”. This data format can easily be generated within TTT from two common file formats that can be obtained using various published DNA metabarcoding pipelines: a read table and a taxonomy table. TTT offers a wide range of processing, filtering and analysis modules. The user can analyze and visualize basic statistics, such as read proportion per taxon, as well as more sophisticated visualizations such as interactive Krona charts for taxonomic data exploration, or complex parallel category diagrams to assess species distribution patterns. Venn diagrams can be calculated to compare taxon overlap among replicates, samples, or analysis methods. Various ecological analyses can be produced directly, including alpha or beta diversity estimates, rarefaction analyses, and principal coordinate or non-metric multidimensional scaling plots. The taxonomy of a data set can be validated via the Global Biodiversity Information Facility (GBIF) API to check for synonyms and spelling mistakes. Furthermore, geographical distribution data can be automatically downloaded from GBIF. Additionally, TTT offers a conversion tool for DNA metabarcoding data into formats required for traditional, taxonomy-based analyses performed by regulatory bioassessment programs. Beyond that, TTT is able to produce fully interactive html-based graphics that can be analyzed in any web browser. The software comes with a manual and tutorial, is free and publicly available through GitHub (https://github.com/TillMacher/TaxonTableTools) or the Python package index (https://pypi.org/project/taxontabletools/).

Ariane Grunz, Dagmar Triebel, Stefan Seifert, Anton Link and Gerhard Rambold : Live Demo of DiversityNaviKey, a Progressive Web Application for interactive diagnosis and identification

Abstract:

DiversityNaviKey (DNK, https://diversityworkbench.net/Portal/DiversityNaviKey) is a Progressive Web Application (PWA) designed as an identification tool in the field of biology and related sciences. The interactive selection procedure is based on a set of descriptors predefined by the data provider. The descriptors are structured and organised as categorical, numeric and free-text. Using DNK, existing knowledge as delivered by bio- and geodiversity experts, taxonomists and ecologists, will be available for diagnosis. The target groups of data providers are professionals, who intend to publish their structured data, describing identifiable biological or environmental entities, reusable under the open data paradigm. The spectrum of users depends on the data source.

DNK is running on each common modern browser, providing an app-like experience to the user. It is device-independent and uses caching mechanisms of browsers such as Service Worker and IndexedDB, to support an offline mode. The content data or trait data is accessed via a web service as JSON packages. A current set up is installed at the SNSB IT Center. It uses SNSB data pipelines with a cache database and the data management tool DiversityDescriptions to ‘flatten’ the data for the web application, transferred via the web service.

The live demo will give a short introduction of the DiversityNaviKey web app, the implementation and architecture of the service. It will demonstrate the current set up with exemplary data sources, available under https://divnavikey.snsb.info. Participants at the live demo are guided to use the web app on their smartphones. The presentation will show the various types and major functions of queries, the setting options and how to switch between data sources. There exist various datasets to get a sufficient impression what kind of data sources may be appropriate for DiversityNaviKey.

Committees

Sven Bingert, GWDG
Jan Bumberger, UFZ
Peter Grobe, ZFMK Bonn
Naouel Karam, Fraunhofer FOKUS
Jens Kattge, Max Planck Institute for Biogeochemistry
Michael Kirstein, GDA Bayern
Friederike Klan, DLR Institut für Datenwissenschaften
Ivaylo Kostadinov, GFBio e.V.
Patrik Mäder, TU Ilmenau
Volker Mosbrugger, Senckenberg Gesellschaft für Naturforschung
Thomas Nauss, Universität Marburg
Jörg Overmann, Leibniz-Institut DSMZ
Henrique Pereira, German Center for Integrative Biodiversity Research iDiV
Uwe Scholz, IPK
Maha Shadaydeh, Friedrich-Schiller-Universität Jena
Philipp Wieder, GWDG
Anton Güntsch, BGBM and FU Berlin
Birgitta König-Ries, Friedrich-Schiller-Universität Jena
Markus Schmalzl, GDA Bayern
Bernhard Seeger, Universität Marburg
Dagmar Triebel, Staatliche Naturwissenschaftliche Sammlungen Bayerns SNSB