Abstract
Enzyme mining, as a central part of biotechnological advancement, relies on robust data management practices. At SINTEF, we recognize that producing data on a project-by-project basis demands adherence to the FAIR principles of Findable, Accessible, Interoperable, and Reusable data.
By standardizing the output and increasing ease of use of our bioinformatic tools we can enhance collaboration, accelerate discoveries, and promote transparency. Our recently developed Nextflow-based pipeline exemplifies this commitment, enabling efficient and standardized enzyme mining across diverse projects.
To standardize and increase the efficiency of enzyme mining, we have implemented a Nextflow pipeline which performs: HMM-profile and BLAST searches, sequence similarity networking and clustering, structure prediction, and structure-based searches. This talk will go through the general methods implemented as well as the technologies used to facilitate such a pipeline.
Nextflow was chosen as framework for the pipeline for the following reasons: 1. A large community and many predefined workflows which can be used as inspiration or directly in our pipeline (nf-core). 2. Support for containers (such as Docker, Apptainer/Singularity) ensuring standardized execution across systems. 3. Simplified scaling for use in high-performance computing and cloud computing. 4. Allows usage of Git for version control and distribution.
The main dataflow in the pipeline starts with searching local (meta)genomic databases for hits of one or more HMM profiles for a specific enzyme family/class. The hits are then clustered using a sequence similarity network where a single representative sequence is selected per cluster. The 3D structures of these representative protein sequences are then predicted using Alphafold and Foldseek is used to search for structural homology of hits in public databases. From the resulting shortlists novel and/or interesting candidates are manually evaluated and selected for downstream experimental characterization.
The established pipeline has led to a more standardized and efficient execution of bioinformatics tools and programs involved in our enzyme mining activities. It also greatly simplifies the use of command line interface (CLI) tools for unfamiliar users by bundling everything into a unified graphical user interface. The pipeline has been utilized in both publicly funded national and international projects, as well as projects privately funded by industry. This includes the following projects: SFI Industrial Biotechnology, BLUETOOLS, ESTELLA, EnXylaScope, and AtlantECO. Going forward, we plan to extend utilization of the Nextflow framework to encompass our additional analysis pipelines in industrial and medical biotechnology, bioprospecting, and environmental and ecological studies.
By standardizing the output and increasing ease of use of our bioinformatic tools we can enhance collaboration, accelerate discoveries, and promote transparency. Our recently developed Nextflow-based pipeline exemplifies this commitment, enabling efficient and standardized enzyme mining across diverse projects.
To standardize and increase the efficiency of enzyme mining, we have implemented a Nextflow pipeline which performs: HMM-profile and BLAST searches, sequence similarity networking and clustering, structure prediction, and structure-based searches. This talk will go through the general methods implemented as well as the technologies used to facilitate such a pipeline.
Nextflow was chosen as framework for the pipeline for the following reasons: 1. A large community and many predefined workflows which can be used as inspiration or directly in our pipeline (nf-core). 2. Support for containers (such as Docker, Apptainer/Singularity) ensuring standardized execution across systems. 3. Simplified scaling for use in high-performance computing and cloud computing. 4. Allows usage of Git for version control and distribution.
The main dataflow in the pipeline starts with searching local (meta)genomic databases for hits of one or more HMM profiles for a specific enzyme family/class. The hits are then clustered using a sequence similarity network where a single representative sequence is selected per cluster. The 3D structures of these representative protein sequences are then predicted using Alphafold and Foldseek is used to search for structural homology of hits in public databases. From the resulting shortlists novel and/or interesting candidates are manually evaluated and selected for downstream experimental characterization.
The established pipeline has led to a more standardized and efficient execution of bioinformatics tools and programs involved in our enzyme mining activities. It also greatly simplifies the use of command line interface (CLI) tools for unfamiliar users by bundling everything into a unified graphical user interface. The pipeline has been utilized in both publicly funded national and international projects, as well as projects privately funded by industry. This includes the following projects: SFI Industrial Biotechnology, BLUETOOLS, ESTELLA, EnXylaScope, and AtlantECO. Going forward, we plan to extend utilization of the Nextflow framework to encompass our additional analysis pipelines in industrial and medical biotechnology, bioprospecting, and environmental and ecological studies.