Facilitating species identification in environmental samples

In three projects with different organisms, we are exploring how standardized analysis pipelines can facilitate DNA-based species identification and its subsequent publication in specialized databases.

Our metabarcoding Use Cases

DNA metabarcoding is still a relatively new method that can be used in molecular biology and biodiversity informatics to identify many species simultaneously. So-called workflows and pipelines are used to execute the necessary algorithms and make the data transformations comprehensible. Based on the requirements of three use case projects in which metabarcoding data on different groups of organisms are analyzed – DNAquaNet, AlgaTerra and GBOL – an easy-to-use pipeline for the analysis of metabarcoding data forbiodiversity research was developed as part of one  NFDI4Biodiversity use case. In addition to facilitating the long-term storage of raw data, the system integrates functionalities to foster the publication of the analyzed data in relevant specialized databases.

About DNAquaNet

DNAquaNet is a European network researching DNA-based monitoring of freshwater ecosystems. By using environmental DNA (eDNA), living organisms in aquatic environmentscan be identified quickly and comprehensively. DNAquaNet, in particular the University of Duisburg-Essen, has an extensive collection of DNA and eDNA datasets that provide valuable insights into biotic communities and are important developments supporting the  monitoring of the Water Framework Directive. The challenge currently lies in the lack of standards for data and methods for analysis. Through NFDI4Biodiversity, DNAquaNet is working to expand access to these solutions for a broader user community. 

About AlgaTerra

The AlgaTerra-Information-System combines research data on microalgae with molecular sequences. As part of NFDI4Biodiversity, this knowledge is made accessible and usable for research through cloud-based metabarcoding tools. AlgaTerra offerscurated data, particularly on diatoms, and facilitatesthe presentation of molecular, ecological and taxonomic information. This data supports both scientific and conservation efforts and is complemented by microscopic images.

About GBOL

The project German Barcode of Life (GBOL) captures marker genes for the identification of organisms and stores them in global reference libraries. In the third project phase, “dark taxa” that are difficult to identify will also be included. The goal is to connect the GBOL data to the cloud-based infrastructure of NFDI4Biodiversity and develop an interface for taxonomy checklists. Additionally, features for species identification and the integration of citizen science projects will be expanded.

The goal: Making valuable sequence data available in long-term

DNA metabarcoding offers great potential for rapidly and efficiently capturingspecies diversity across  different ecosystems. By simultaneously  sequencing small, standardized gene fragments (DNA barcodes), thousands of individuals can be identified, often down to species level. The DNA sequences obtained are sorted using bioinformatic algorithms and assigned to the corresponding species by comparing them with a reference database. This method enables the monitoring of  almost the entire species diversity of a habitat and provides semi-quantitative information on the frequency of species in the samples.

Based on this, our goal within NFDI4Biodiversity is to provide a user-friendly pipeline for determining amplicon sequence variants (ASVs) from raw sequencing data and assessing the frequency of these unique sequences in the samples. In the next step, these sequences are assigned taxonomic identities using a reference database, enabling precise and comparable species identification.

Established archives, such as the National Center for Biotechnology Information (NCBI) and the European Nucleotide Archive (ENA), which are based on the MIxS standard and enable simple use, are available to make the sequence data available in the long term. Templates are provided to simplify data transfer to these archives. Additionally, NFDI4Biodiversity offers a Data Submission Service that facilitates the long-term submission of sequence data to the Global Biodiversity Information Facility (GBIF) and provides support in applying standards.

Achievements so far: data mobilization and community involvement

The experts in the three use case projects rely on the APSCALE pipeline (apscale on GitHub) to analyze the data. In cooperation with the NFDI consortium NFDI4Microbiota, a user-friendly workflow was developed for the cloud-based workflow manager CloWM. This workflow includes an input interface for raw data and parameters and handles the entire execution and data management of the APSCALE pipeline. Initial tests have been successfully conducted on a staging instance (clowm-staging.bi.denbi.de), with plans to publish on the production instance clowm.de. The sequence data obtained will be compared to taxon-specific reference databases to generate species lists from the environmental samples.

The computing resources provided by de.NBI enable the standardized analysis of large datasets with minimal time investment, regardless of the users' individual technical setups, and allow for direct transfer of data into long-term archives. Long-term support and regular updates to the pipeline are assured. With an analysis pipeline now suitable for various species groups, best practice recommendations are being developed for the communities involved in each use case.

Promising initiatives are already underway for collaboration between the DNA metabarcoding community and the GBIF network to publish analyzed species occurrence data. A key project in this effort is the development of the GBIF Metabarcoding Toolkit (GBIF MBT), which allows researchers to mobilize and access their data directly within GBIF and to identify synonymies using the  Catalogue of Life (COL). The output formats of the NFDI4Biodiversity metabarcoding pipeline are aligned with the requirements of the GBIF MBT currently in development, ensuring seamless handling of generated species occurrence data and facilitating integration into platforms such as the Global Biodiversity Information Facility (GBIF) and the Living Atlas of Nature Germany (LAND), another NFDI4Biodiversity use case.

Contact

Would you like to find out more about this use case? You are welcome to contact the people involved.

Use Case Manager (NFDI4Biodiversity)

Christoph Schomburg (c.schomburg@uni-kassel.de)

Technical expertise (NFDI4Biodiversity)

Ivaylo Kostadinov (ikostadi@gfbio.org)