Module 2.3b Reference-based assembly 2023 (SARS-Cov-2/Influenza) (outline)

Theme 2: Sequencing and assembly


Note: This module was removed from the 2024 version of the course, since reference-based genome assembly is more commonly used with viruses than bacteria. By removing this block, more time could be spend on assembly QC.


2.3b: Reference-based assembly (SARS-Cov-2/Influenza)

 

Time 

Activity description 

ILOs 

10-12

Short practical intro


Practical: “Generate a SARS-CoV-2 genome step by step”


·       Understand the key steps involved in generating a reference-based assembly

·       Assemble a SARS-CoV-2 genome step by step

 

13-14.30

Short practical intro


Practical: “Generate an influenza genome with IRMA”

·       Summarize key challenges related to sequencing influenza genomes

·       Use a published pipeline for generating a viral genome

·       Understand the result-files of the pipeline

·       Know how to modify parameters to adapt the analysis to user-specific needs

 

14.45-16.30

Practical: “Coding session: write scripts to QC your genomes”


·       Be familiar with the most common stats used for evaluating reference-based assemblies

·       Be able to write custom python scripts to collect stats from fasta-files, fastq-files, bam-files, and plain-text files

 

 

 

Details

In the morning session practical, the participants will get hands-on experience on how to generate a SARS-Cov-2 genome from raw reads.  At SSI, this is a fully automated process, with cron jobs continuously scanning for new data and executing an in-house developed pipeline. However, in the practical, the participants will perform each of the steps in this pipeline manually (on a single sample). Specifically, they will do: 

 

·       Quality-trimming of raw-data (trim-galore)

·       Mapping of reads to reference genome (bwa mem)

·       Sorting bam-files and removing unmapped reads (samtools)

·       Primer-trimming on bam-files (iVar)

·       Consensus-calling with bcftools (bcftools)

 

By executing each step manually, the participants will get an understanding of the different steps involved.

 

In the afternoon session practical, the participants will use a published fully automated pipeline (IRMA), where all the steps have been combined. With IRMA, genome assembly becomes a single command on the command-line, basically specifying input files and output folder name. However, we will also discuss how parameters can be changed, depending on the analysis goals and data. Moreover, some time will be spent browsing the various output files, as IRMA produces a lot of information in addition to the genome sequences. 

 

In the last practical of the day, the participants will have time to write their own scripts for collecting QC-relevant parameters from their genomes (fasta-files) and other intermediate files. Specifically, the participants will write scripts to calculate genome-length, number of undetermined bases (Ns), number of ambiguous sites, number of raw reads and number of mapped reads.


Last modified: Monday, 25 March 2024, 11:03 AM