16S Microbial Analysis with mothur (short)
Overview
question Questionsobjectives Objectives
What is the effect of normal variation in the gut microbiome on host health?
Analyze of 16S rRNA sequencing data using the mothur toolsuite in Galaxy
Using a mock community to assess the error rate of your sequencing experiment
Visualize sample diversity using Krona and Phinch
Supporting Materials
Overview
In this tutorial we will perform an analysis based on the Standard Operating Procedure (SOP) for MiSeq data, developed by the Schloss lab, the creators of the mothur software package Schloss et al. 2009.
comment Note: Two versions of this tutorial
Because this tutorial consists of many steps, we have made two versions of it, one long and one short.
This is the shortened version. Instead of running each tool individually, we will employ workflows to run groups of analysis steps (e.g. data cleaning) at once. If you would like more in-depth discussion of each step, please see the longer version of tutorial
You can also switch between the long and short version at the start of any section.
Agenda
In this tutorial, we will cover:
comment Note: results may vary
Your results may be slightly different from the ones presented in this tutorial due to differing versions of tools, reference data, external databases, or because of stochastic processes in the algorithms.
Obtaining and preparing data
In this tutorial we use 16S rRNA data, but similar pipelines can be used for WGS data.
comment Background: The 16S ribosomal RNA gene
The 16S rRNA gene has several properties that make it ideally suited for our purposes
- Present in all prokaryotes
- Highly conserved + highly variable regions
- Huge reference databases
The highly conserved regions make it easy to target the gene across different organisms, while the highly variable regions allow us to distinguish between different species.
(slide credit https://www.slideshare.net/beiko/ccbc-tutorial-beiko)
Understanding our input data
In this tutorial we use the dataset generated by the Schloss lab to illustrate their MiSeq SOP.
They describe the experiment as follows:
“The Schloss lab is interested in understanding the effect of normal variation in the gut microbiome on host health. To that end, we collected fresh feces from mice on a daily basis for 365 days post weaning. During the first 150 days post weaning (dpw), nothing was done to our mice except allow them to eat, get fat, and be merry. We were curious whether the rapid change in weight observed during the first 10 dpw affected the stability microbiome compared to the microbiome observed between days 140 and 150.”
To speed up analysis for this tutorial, we will use only a subset of
this data. We will look at a single mouse at 10 different
time points (5 early, 5 late). In order to assess the error rate of the
analysis pipeline and experimental setup, the Schloss lab
additionally sequenced a mock community with a known composition
(genomic DNA from 21 bacterial strains). The sequences used
for this mock sample are contained in the file HMP_MOCK.v35.fasta
comment Dataset naming scheme
For this tutorial, you are given 10 pairs of files. For example, the following pair of files:
F3D0_S188_L001_R1_001.fastq
F3D0_S188_L001_R2_001.fastq
The first part of the file name indicates the sample;
F3D0
here signifies that this sample was obtained from Female 3 on Day 0. The rest of the file name is identical, except for_R1
and_R2
, this is used to indicate the forward and reverse reads respectively.
Importing the data into Galaxy
Now that we know what our input data is, let’s get it into our Galaxy history:
All data required for this tutorial has been made available from Zenodo
hands_on Hands-on: Obtaining our data
Make sure you have an empty analysis history. Give it a name.
tip Tip: Creating a new history
Click the new-history icon at the top of the history panel
If the new-history is missing:
- Click on the galaxy-gear icon (History options) on the top of the history panel
- Select the option Create New from the menu
- Import Sample Data.
- Import the sample FASTQ files to your history, either from a shared data library (if available), or from Zenodo using the URLs listed in the box below (click param-repeat to expand):
solution List of Zenodo URLs
https://zenodo.org/record/800651/files/F3D0_R1.fastq https://zenodo.org/record/800651/files/F3D0_R2.fastq https://zenodo.org/record/800651/files/F3D141_R1.fastq https://zenodo.org/record/800651/files/F3D141_R2.fastq https://zenodo.org/record/800651/files/F3D142_R1.fastq https://zenodo.org/record/800651/files/F3D142_R2.fastq https://zenodo.org/record/800651/files/F3D143_R1.fastq https://zenodo.org/record/800651/files/F3D143_R2.fastq https://zenodo.org/record/800651/files/F3D144_R1.fastq https://zenodo.org/record/800651/files/F3D144_R2.fastq https://zenodo.org/record/800651/files/F3D145_R1.fastq https://zenodo.org/record/800651/files/F3D145_R2.fastq https://zenodo.org/record/800651/files/F3D146_R1.fastq https://zenodo.org/record/800651/files/F3D146_R2.fastq https://zenodo.org/record/800651/files/F3D147_R1.fastq https://zenodo.org/record/800651/files/F3D147_R2.fastq https://zenodo.org/record/800651/files/F3D148_R1.fastq https://zenodo.org/record/800651/files/F3D148_R2.fastq https://zenodo.org/record/800651/files/F3D149_R1.fastq https://zenodo.org/record/800651/files/F3D149_R2.fastq https://zenodo.org/record/800651/files/F3D150_R1.fastq https://zenodo.org/record/800651/files/F3D150_R2.fastq https://zenodo.org/record/800651/files/F3D1_R1.fastq https://zenodo.org/record/800651/files/F3D1_R2.fastq https://zenodo.org/record/800651/files/F3D2_R1.fastq https://zenodo.org/record/800651/files/F3D2_R2.fastq https://zenodo.org/record/800651/files/F3D3_R1.fastq https://zenodo.org/record/800651/files/F3D3_R2.fastq https://zenodo.org/record/800651/files/F3D5_R1.fastq https://zenodo.org/record/800651/files/F3D5_R2.fastq https://zenodo.org/record/800651/files/F3D6_R1.fastq https://zenodo.org/record/800651/files/F3D6_R2.fastq https://zenodo.org/record/800651/files/F3D7_R1.fastq https://zenodo.org/record/800651/files/F3D7_R2.fastq https://zenodo.org/record/800651/files/F3D8_R1.fastq https://zenodo.org/record/800651/files/F3D8_R2.fastq https://zenodo.org/record/800651/files/F3D9_R1.fastq https://zenodo.org/record/800651/files/F3D9_R2.fastq https://zenodo.org/record/800651/files/Mock_R1.fastq https://zenodo.org/record/800651/files/Mock_R2.fastq
tip Tip: Importing data via links
- Copy the link location
Open the Galaxy Upload Manager (galaxy-upload on the top-left of the tool panel)
- Select Paste/Fetch Data
Paste the link into the text field
Press Start
- Close the window
By default, Galaxy uses the URL as the name, so rename the files with a more useful name.
tip Tip: Importing data from a data library
As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:
Go into Shared data (top panel) then Data libraries
Find the correct folder (ask your instructor)
- Select the desired files
- Click on the To History button near the top and select as Datasets from the dropdown menu
- In the pop-up window, select the history you want to import the files to (or create a new one)
- Click on Import
- Import Reference Data
- Import the following reference datasets
silva.v4.fasta
HMP_MOCK.v35.fasta
trainset9_032012.pds.fasta
trainset9_032012.pds.tax
solution List of Zenodo URLs
https://zenodo.org/record/800651/files/HMP_MOCK.v35.fasta https://zenodo.org/record/800651/files/silva.v4.fasta https://zenodo.org/record/800651/files/trainset9_032012.pds.fasta https://zenodo.org/record/800651/files/trainset9_032012.pds.tax https://zenodo.org/record/800651/files/mouse.dpw.metadata
Now that’s a lot of files to manage. Luckily Galaxy can make life a bit easier by allowing us to create dataset collections. This enables us to easily run tools on multiple datasets at once.
Since we have paired-end data, each sample consist of two separate fastq files, one containing the
forward reads, and one containing the reverse reads. We can recognize the pairing from the file names,
which will differ only by _R1
or _R2
in the filename. We can tell Galaxy about this paired naming
convention, so that our tools will know which files belong together. We do this by building a List of Dataset Pairs
hands_on Hands-on: Organizing our data into a paired collection
Click on the checkmark icon param-check at top of your history.
- Select all the FASTQ files (40 in total)
- Tip: type
fastq
in the search bar at the top of your history to filter only the FASTQ files; you can now use theAll
button at the top instead of having to individually select all 40 input files.- Click on for all selected..
- Select Build List of Dataset Pairs from the dropdown menu
In the next dialog window you can create the list of pairs. By default Galaxy will look for pairs of files that differ only by a
_1
and_2
part in their names. In our case however, these should be_R1
and_R2
.- Change these values accordingly
- Change
_1
to_R1
in the text field on the top left- Change
_2
to_R2
om the text field on the top rightYou should now see a list of pairs suggested by Galaxy:
Click on auto-pair to create the suggested pairs.
- Name the pairs
- The middle segment is the name for each pair.
- These names will be used as sample names in the downstream analysis, so always make sure they are informative!
- Check that the pairs are named
F3D0
-F3D9
,F3D141
-F3D150
andMock
.
- If needed, the names can be edited by clicking on them
- Important: On older Galaxy versions, if the files were imported via URL, they may have the full URL as sample name; please remove everything except the sample name for each pair!
- Name your collection at the bottom right of the screen
- Click the Create List button. A new dataset collection item will now appear in your history
Quality Control
exchange Switch to extended tutorial
For more information on the topic of quality control, please see our training materials here.
Before starting any analysis, it is always a good idea to assess the quality of your input data and improve it where possible by trimming and filtering reads. The mothur toolsuite contains several tools to assist with this task. We will begin by merging our reads into contigs, followed by filtering and trimming of reads based on quality score and several other metrics.
Create contigs from paired-end reads
In this experiment, paired-end sequencing of the ~253 bp V4 region of the 16S rRNA gene was performed. The sequencing was done from either end of each fragment. Because the reads are about 250 bp in length, this results in a significant overlap between the forward and reverse reads in each pair. We will combine these pairs of reads into contigs.
The Make.contigs tool creates the contigs, and uses the paired collection as input. Make.contigs will look at each pair, take the reverse complement reverse read, and then determine the overlap between the two sequences. Where an overlapping base call differs between the two reads, the quality score is used to determine the consensus base call. A new quality score is derived by combining the two original quality scores in both of the reads for all the overlapping positions.
hands_on Hands-on: Combine forward and reverse reads into contigs
- Make.contigs tool with the following parameters
- param-select “Way to provide files”:
Multiple pairs - Combo mode
- param-collection “Fastq pairs”: the collection you just created
- Leave all other parameters to the default settings
This step combined the forward and reverse reads for each sample, and also combined the resulting contigs from all samples into a single file. So we have gone from a paired collection of 20x2 FASTQ files, to a single FASTA file. In order to retain information about which reads originated from which samples, the tool also output a group file. View that file now, it should look something like this:
M00967_43_000000000-A3JHG_1_1101_10011_3881 F3D0
M00967_43_000000000-A3JHG_1_1101_10050_15564 F3D0
M00967_43_000000000-A3JHG_1_1101_10051_26098 F3D0
[..]
Here the first column contains the read name, and the second column contains the sample name.
Data Cleaning
Next, we want to improve the quality of our data. To this end we will run a workflow that performs the following steps:
- Filter by length
We know that the V4 region of the 16S gene is around 250 bp long. Anything significantly longer was likely a poorly assembled contig. We will remove any contigs longer than 275 base pairs using the Screen.seqs tool tool. - Remove low quality contigs
We will also remove any contigs containing too many ambiguous base calls. - Deduplicate sequences
Since we are sequencing many of the same organisms, there will likely be many identical contigs. To speed up downstream analysis we will determine the set of unique contigs using Unique.seqs tool.
hands_on Hands-on: Perform data cleaning
- Import the workflow into Galaxy
- Copy the URL (e.g. via right-click) of this workflow or download it to your computer.
- Import the workflow into Galaxy
tip Tip: Importing a workflow
- Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.
- Click on the upload icon galaxy-upload at the top-right of the screen
- Provide your workflow
- Option 1: Paste the URL of the workflow into the box labelled “Archived Workflow URL”
- Option 2: Upload the worflow file in the box labelled “Archived Workflow File”
- Click the Import workflow button
- Run Workflow 1: Quality Control workflow using the following parameters:
- “Send results to a new history”:
No
- param-file “1: Contigs”: the
trim.contigs.fasta
output from Make.contigs tool- param-file “2: Groups”: the
group file
from Make.contigs tooltip Tip: Running a workflow
- Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.
- Click on the dropdown menu galaxy-dropdown next to your workflow
- Select Run from the list
- Configure the workflow as needed
- Click the Run Workflow button at the top-right of the screen
- You may have to refresh your history to see the queued jobs
question Questions
- How many sequences were removed in the screening step?
- How many unique sequences are there in our cleaned dataset?
solution Solutions
The screening removed 23,488 sequences.
This can be determined by looking at the number of lines in
bad.accnos
output of Screen.seqs tool or by comparing the total number of sequences before and after this screening step in the output of Summary.seqs toolThere are 16,426 unique sequences.
This can be determined by expanding one of the outputs of Unique.seqs tool and looking at the number of lines in the file.
Have a look at the count_table output from the Count.seqs tool, it summarizes the number of times each unique sequence was observed across each of the samples. It will look something like this:
Representative_Sequence total F3D0 F3D1 F3D141 F3D142 ...
M00967_43_000000000-A3JHG_1_1101_14069_1827 4402 370 29 257 142
M00967_43_000000000-A3JHG_1_1101_18044_1900 28 1 0 1 0
M00967_43_000000000-A3JHG_1_1101_13234_1983 10522 425 281 340 205
...
The first column contains the read names of the representative sequences, and the subsequent columns contain the number of duplicates of this sequence observed in each sample.
comment Representative sequences vs Total sequences
From now on, we will only work with the set of unique sequences, but it’s important to remember that these represent a larger number of total sequences, which we keep track of in the count table.
The Summary.seqs tool tool will report both the number of unique representative sequences as well as the total sequences they represent.
Sequence Alignment
exchange Switch to extended tutorial
For more information on the topic of alignment, please see our training materials here
We are now ready to align our sequences to the reference. This is an important step to improve the clustering of your OTUs Schloss 2012.
hands_on Hands-on: Align sequences
- Align.seqs tool with the following parameters
- param-file “fasta”: the
fasta
output from Unique.seqs tool- param-file “reference”:
silva.v4.fasta
reference file from your historyquestion Question
Have a look at the alignment output, what do you see?
solution Solution
At first glance, it might look like there is not much information there. We see our read names, but only period
.
characters below it.>M00967_43_000000000-A3JHG_1_1101_14069_1827 ............................................................................ >M00967_43_000000000-A3JHG_1_1101_18044_1900 ............................................................................
This is because the V4 region is located further down our reference database and nothing aligns to the start of it. If you scroll to right you will start seeing some more informative bits:
.....T-------AC---GG-AG-GAT------------
Here we start seeing how our sequences align to the reference database. There are different alignment characters in this output:
.
: terminal gap character (before the first or after the last base in our query sequence)-
: gap character within the query sequenceWe will cut out only the V4 region in a later step (Filter.seqs tool)
- Summary.seqs tool with the following parameters:
- param-file “fasta”: the
align
output from Align.seqs tool- param-file “count”:
count_table
output from Count.seqs tool- “Output logfile?”:
yes
Have a look at the summary output (log file):
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1250 10693 250 0 3 1
2.5%-tile: 1968 11550 252 0 3 3222
25%-tile: 1968 11550 252 0 4 32219
Median: 1968 11550 252 0 4 64437
75%-tile: 1968 11550 253 0 5 96655
97.5%-tile: 1968 11550 253 0 6 125651
Maximum: 1982 13400 270 0 12 128872
Mean: 1967.99 11550 252.462 0 4.36693
# of unique seqs: 16426
total # of seqs: 128872
The Start
and End
columns tell us that the majority of reads aligned between positions 1968 and 11550,
which is what we expect to find given the reference file we used. However, some reads align to very different positions,
which could indicate insertions or deletions at the terminal ends of the alignments or other complicating factors.
Also notice the Polymer
column in the output table. This indicates the average homopolymer length. Since we know that
our reference database does not contain any homopolymer stretches longer than 8 reads, any reads containing such
long stretches are likely the result of PCR errors and we would be wise to remove them.
Next we will clean our data further by removing poorly aligned sequences and any sequences with long homopolymer stretches.
More Data Cleaning
To ensure that all our reads overlap our region of interest, we will:
- Remove any reads not overlapping the region V4 region using Screen.seqs tool.
- Remove any overhang on either end of the V4 region to ensure our sequences overlap only the V4 region, using Filter.seqs tool.
- Clean our alignment file by removing any columns that have a gap character (
-
, or.
for terminal gaps) at that position in every sequence (also using Filter.seqs tool). - Group near-identical sequences together with Pre.cluster tool. Sequences that only differ by one or two bases at this point are likely to represent sequencing errors rather than true biological variation, so we will cluster such sequences together.
- Remove Sequencing artefacts known as chimeras (discussed in next section).
Chimera Removal
During PCR amplification, it is possible that two unrelated templates are combined to form a sort of hybrid sequence,
also called a chimera. Needless to say, we do not want such sequencing artefacts confounding our results. We’ll do
this chimera removal using the VSEARCH
algorithm Rognes et al. 2016 that is called within mothur, using the
Chimera.vsearch tool tool.
comment Background: Chimeras
(slide credit: http://slideplayer.com/slide/4559004/ )
hands_on Hands-on: Clean Aligned sequences and Chimera Removal
- Import the workflow into Galaxy
- Copy the URL (e.g. via right-click) of this workflow or download it to your computer.
- Import the workflow into Galaxy
tip Tip: Importing a workflow
- Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.
- Click on the upload icon galaxy-upload at the top-right of the screen
- Provide your workflow
- Option 1: Paste the URL of the workflow into the box labelled “Archived Workflow URL”
- Option 2: Upload the worflow file in the box labelled “Archived Workflow File”
- Click the Import workflow button
- Run Workflow 2: Data Cleaning and Chimera Removal workflow using the following parameters:
- “Send results to a new history”:
No
- param-file “1: Aligned Sequences”: the
align
output from Align.seqs tool- param-file “2: Count Table”: the
count table
from Count.seqs tooltip Tip: Running a workflow
- Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.
- Click on the dropdown menu galaxy-dropdown next to your workflow
- Select Run from the list
- Configure the workflow as needed
- Click the Run Workflow button at the top-right of the screen
- You may have to refresh your history to see the queued jobs
question Question
- How many chimeric sequences were detected?
- How many sequences remain after these cleaning steps?
solution Solution
There were 3,439 representative sequences flagged as chimeric. These represent a total of 10,564 total sequences
This can be determined by looking at the number of sequences in the
vsearch.accnos
file (3439). To determine how many total sequences these represent, compare the Summary.seqs log output files before and after the chimera filtering step (128,655-118,091=10,564).There are 2,281 remaining sequences after filtering, clustering of highly similar sequences, and chimera removal.
This can be determined by looking at the number of sequences in the fasta output of Remove.seqs tool
Have a look at the FASTA output from Pre.cluster, it should looks something like this:
>M00967_43_000000000-A3JHG_1_1101_13234_1983
TAC--GG-AG-GAT--GCG-A-G-C-G-T-T--AT-C-CGG-AT--TT-A-T-T--GG-GT--TT-A-AA-GG-GT-GC-G-CA-GGC-G-G-A-AG-A-T-C-AA-G-T-C-A-G-C-G-G--TA-A-AA-TT-G-A-GA-GG--CT-C-AA-C-C-T-C-T-T-C--GA-G-C-CGTT-GAAAC-TG-G-TTTTC-TTGA-GT-GA-GC-GA-G-A---AG-T-A-TGCGGAATGCGTGGTGT-AGCGGT-GAAATGCATAG-AT-A-TC-AC-GC-AG-AACTCCGAT-TGCGAAGGCA------GC-ATA-CCG-G-CG-CT-C-A-ACTGACG-CTCA-TGCA-CGAAA-GTG-TGGGT-ATC-GAACAGG
>M00967_43_000000000-A3JHG_1_1101_14069_1827
TAC--GG-AG-GAT--GCG-A-G-C-G-T-T--AT-C-CGG-AT--TT-A-T-T--GG-GT--TT-A-AA-GG-GT-GC-G-TA-GGC-G-G-C-CT-G-C-C-AA-G-T-C-A-G-C-G-G--TA-A-AA-TT-G-C-GG-GG--CT-C-AA-C-C-C-C-G-T-A--CA-G-C-CGTT-GAAAC-TG-C-CGGGC-TCGA-GT-GG-GC-GA-G-A---AG-T-A-TGCGGAATGCGTGGTGT-AGCGGT-GAAATGCATAG-AT-A-TC-AC-GC-AG-AACCCCGAT-TGCGAAGGCA------GC-ATA-CCG-G-CG-CC-C-T-ACTGACG-CTGA-GGCA-CGAAA-GTG-CGGGG-ATC-AAACAGG
We see that these are our contigs, but with extra alignment information. The filtering steps have removed any positions which had a gap symbol in all reads of the dataset.
Taxonomic Classification
exchange Switch to extended tutorial
Now that we have thoroughly cleaned our data, we are finally ready to assign a taxonomy to our sequences.
We will do this using a Bayesian classifier (via the Classify.seqs tool tool) and a mothur-formatted training set provided by the Schloss lab based on the RDP (Ribosomal Database Project, Cole et al. 2013) reference taxonomy.
comment Background: Taxonomic assignment
In this tutorial we will use the RDP classifier and reference taxonomy for classification, but there are several different taxonomic assignment algorithms and reference databases available for this purpose.
An overview of different methods is given by Liu et al. 2008 and shown below:
The most commonly used reference taxonomies are:
- SILVA (Quast et al. 2012)
- GreenGenes (DeSantis et al. 2006)
- RDP (Cole et al. 2013)
- NCBI Taxonomy Database (Federhen 2011)
The choice of taxonomic classifier and reference taxonomy can impact downstream results. The figure from Liu et al. 2008 given below shows the taxonomic composition determined when using different classifiers and reference taxonomies, for different primer sets (16S regions).
Which reference taxonomy is best for your experiments depends on a number of factors such as the type of sample and variable region sequenced.
Another discussion about how these different databases compare was described by Balvočiūtė and Huson 2017.
Removal of non-bacterial sequences
Despite all we have done to improve data quality, there may still be more to do: there may be 18S rRNA gene fragments or 16S rRNA from Archaea, chloroplasts, and mitochondria that have survived all the cleaning steps up to this point. We are generally not interested in these sequences and want to remove them from our dataset.
hands_on Hands-on: Taxonomic Classification and removal of non-bacterial sequences
- Import the workflow into Galaxy
- Copy the URL (e.g. via right-click) of this workflow or download it to your computer.
- Import the workflow into Galaxy
tip Tip: Importing a workflow
- Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.
- Click on the upload icon galaxy-upload at the top-right of the screen
- Provide your workflow
- Option 1: Paste the URL of the workflow into the box labelled “Archived Workflow URL”
- Option 2: Upload the worflow file in the box labelled “Archived Workflow File”
- Click the Import workflow button
- Run Workflow 3: Classification workflow using the following parameters:
- “Send results to a new history”:
No
- param-file “1: Cleaned sequences”: the
fasta
output from Remove.seqs tool- param-file “2: Count Table”: the
count table
from Remove.seqs tool- param-file “3: Training set Taxonomy”:
trainset9_032012.pds.tax
file you imported from Zenodo- param-file “4: Training set FASTA”:
trainset9_032012.pds.fasta
file from Zenodotip Tip: Running a workflow
- Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.
- Click on the dropdown menu galaxy-dropdown next to your workflow
- Select Run from the list
- Configure the workflow as needed
- Click the Run Workflow button at the top-right of the screen
- You may have to refresh your history to see the queued jobs
question Questions
How many non-bacterial sequences were removed? Determine both the number of representative sequences and total sequences removed.
solution Solution
There were 20 representative sequences removed, representing 162 total sequences. This can be determined by looking at the summary.seqs log outputs before the Remove.lineage step (after chimera removal), and after.
The data is now as clean as we can get it. In the next section we will use the Mock sample to assess how accurate our sequencing and bioinformatics pipeline is.
Optional: Calculate error rates based on our mock community
exchange Switch to extended tutorial
comment Skipping the mock community analysis
The mock community analysis is optional. If you are low on time or want to skip ahead, you can jump straight to the next section where we will cluster our sequences into OTUs, classify them and perform some visualisations.
Click here to skip this section and continue with the analysis.
The following step is only possible if you have co-sequenced a mock community with your samples. A mock community is a sample of which you know the exact composition and is something we recommend to do, because it will give you an idea of how accurate your sequencing and analysis protocol is.
comment Background: Mock communities
What is a mock community?
A mock community is an artificially constructed sample; a defined mixture of microbial cells and/or viruses or nucleic acid molecules created in vitro to simulate the composition of a microbiome sample or the nucleic acid isolated therefrom.
Why sequence a mock community?
In a mock community, we know exactly which sequences/organisms we expect to find, and at which proportions. Therefore, we can use such an artificial sample to assess the error rates of our sequencing and analysis pipeline.
- Did we miss any of the sequences we know to be present in the sample (false negatives)?
- Do we find any sequences that were not present in the sample (false positives)?
- Were we able to accurately detect their relative abundances?
If our workflow performed well on the mock sample, we have more confidence in the accuracy of the results on the rest of our samples.
Example
As an example, consider the following image from Fouhy et al Fouhy et al. 2016. A mock community sample was sequenced for different combinations of sequencer and primer sets (V-regions). Since we know the expected outcome, we can assess the accuracy of each pipeline. A similar approach can be used to assess different parameter settings of the in-silico analysis pipeline.
Further reading
- Next generation sequencing data of a defined microbial mock community Singer et al. 2016
- 16S rRNA gene sequencing of mock microbial populations- impact of DNA extraction method, primer choice and sequencing platform Fouhy et al. 2016
The mock community in this experiment was composed of genomic DNA from 21 bacterial strains. So in a perfect world, this is exactly what we would expect the analysis to produce as a result.
First, let’s extract the sequences belonging to our mock samples from our data:
hands_on Hands-on: extract mock sample from our dataset
- Get.groups tool with the following parameters
- param-file “group file or count table”: the
count table
from Remove.lineage tool- param-select “groups”:
Mock
- param-file “fasta”:
fasta
output from Remove.lineage tool- param-check “output logfile?”:
yes
In the log file we see the following:
Selected 58 sequences from your fasta file.
Selected 4046 sequences from your count file
The Mock sample has 58 unique sequences, representing a total of 4,046 total sequences.
The Seq.error tool measures the error rates using our mock reference. Here we align the reads from our mock sample back to their known sequences, to see how many fail to match.
hands_on Hands-on: Assess error rates based on a mock community
- Seq.error tool with the following parameters
- param-file “pick.fasta”: the
fasta
output from Get.groups tool- param-file “reference”:
HMP_MOCK.v35.fasta
file from your history- param-file “count”: the
count table
from Get.groups tool- param-check “output log?”:
yes
In the log file we see something like this:
Overall error rate: 6.5108e-05
Errors Sequences
0 3998
1 3
2 0
3 2
4 1
[..]
That is pretty good! The error rate is only 0.0065%! This gives us confidence that the rest of the samples are also of high quality, and we can continue with our analysis.
Cluster mock sequences into OTUs
We will now estimate the accuracy of our sequencing and analysis pipeline by clustering the Mock sequences into OTUs, and comparing the results with the expected outcome.
comment Background: What are Operational Taxonomic Units (OTUs)?
In 16S metagenomics approaches, OTUs are clusters of similar sequence variants of the 16S rDNA marker gene sequence. Each of these clusters is intended to represent a taxonomic unit of a bacteria species or genus depending on the sequence similarity threshold. Typically, OTU cluster are defined by a 97% identity threshold of the 16S gene sequence variants at species level. 98% or 99% identity is suggested for strain separation.
(Image credit: Danzeisen et al. 2013, 10.7717/peerj.237)
hands_on Hands-on: Cluster mock sequences into OTUs
- Import the workflow into Galaxy
- Copy the URL (e.g. via right-click) of this workflow or download it to your computer.
- Import the workflow into Galaxy
tip Tip: Importing a workflow
- Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.
- Click on the upload icon galaxy-upload at the top-right of the screen
- Provide your workflow
- Option 1: Paste the URL of the workflow into the box labelled “Archived Workflow URL”
- Option 2: Upload the worflow file in the box labelled “Archived Workflow File”
- Click the Import workflow button
- Run Workflow 4: Mock OTU Clustering workflow using the following parameters:
- “Send results to a new history”:
No
- param-file “1: Mock Count Table”: the
count table
output from Get.groups tool- param-file “2: Mock Sequences”: the
fasta
output from Get.groups tooltip Tip: Running a workflow
- Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.
- Click on the dropdown menu galaxy-dropdown next to your workflow
- Select Run from the list
- Configure the workflow as needed
- Click the Run Workflow button at the top-right of the screen
- You may have to refresh your history to see the queued jobs
question Question
How many OTUs were identified in our mock community?
solution Solution
Answer: 34
This can be determined by opening the shared file or OTU list and looking at the header line. You will see a column for each OTU
Open the rarefaction output (dataset named sobs
inside the rarefaction curves
output collection), it should look
something like this:
numsampled 0.03- lci- hci-
1 1.0000 1.0000 1.0000
100 18.0240 16.0000 20.0000
200 19.2160 17.0000 22.0000
300 19.8800 18.0000 22.0000
400 20.3600 19.0000 22.0000
[..]
3000 30.4320 28.0000 33.0000
3100 30.8800 28.0000 34.0000
3200 31.3200 29.0000 34.0000
3300 31.6320 29.0000 34.0000
3400 31.9920 30.0000 34.0000
3500 32.3440 30.0000 34.0000
3600 32.6560 31.0000 34.0000
3700 32.9920 31.0000 34.0000
3800 33.2880 32.0000 34.0000
3900 33.5920 32.0000 34.0000
4000 33.8560 33.0000 34.0000
4046 34.0000 34.0000 34.0000
When we use the full set of 4060 sequences, we find 34 OTUs from the Mock community; and with 3000 sequences, we find about 31 OTUs. In an ideal world, we would find exactly 21 OTUs. Despite our best efforts, some chimeras or other contaminations may have slipped through our filtering steps.
comment Background: Rarefaction
To estimate the fraction of species sequenced, rarefaction curves are typically used. A rarefaction curve plots the number of species as a function of the number of individuals sampled. The curve usually begins with a steep slope, which at some point begins to flatten as fewer species are being discovered per sample: the gentler the slope, the less contribution of the sampling to the total number of operational taxonomic units or OTUs.
Green, most or all species have been sampled; blue, this habitat has not been exhaustively sampled; red, species rich habitat, only a small fraction has been sampled.
(A Primer on Metagenomics Wooley et al. 2010 )
Now that we have assessed our error rates we are ready for some real analysis.
OTU Clustering
exchange Switch to extended tutorial
In this tutorial we will continue with an OTU-based approach, for the phylotype and phylogenic approaches, please refer to the mothur wiki page.
comment Background: What are Operational Taxonomic Units (OTUs)?
In 16S metagenomics approaches, OTUs are clusters of similar sequence variants of the 16S rDNA marker gene sequence. Each of these clusters is intended to represent a taxonomic unit of a bacteria species or genus depending on the sequence similarity threshold. Typically, OTU cluster are defined by a 97% identity threshold of the 16S gene sequence variants at species level. 98% or 99% identity is suggested for strain separation.
(Image credit: Danzeisen et al. 2013, 10.7717/peerj.237)
We will now repeat the OTU clustering we performed on our mock community for our real datasets. We use a slightly different workflow because these tools are faster for larger datasets. We will also normalize our data by subsampling to the level of the sample with the lowest number of sequences in it.
hands_on Hands-on: Cluster our data into OTUs
- Import the workflow into Galaxy
- Copy the URL (e.g. via right-click) of this workflow or download it to your computer.
- Import the workflow into Galaxy
tip Tip: Importing a workflow
- Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.
- Click on the upload icon galaxy-upload at the top-right of the screen
- Provide your workflow
- Option 1: Paste the URL of the workflow into the box labelled “Archived Workflow URL”
- Option 2: Upload the worflow file in the box labelled “Archived Workflow File”
- Click the Import workflow button
- Run Workflow 5: OTU Clustering workflow using the following parameters:
- “Send results to a new history”:
No
- param-file “1: Sequences”: the
fasta
output from Remove.lineage tool- param-file “2: Count table”: the
count table
output from Remove.lineage tool- param-file “3: Taxonomy”: the
taxonomy
output from Remove.lineage tooltip Tip: Running a workflow
- Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.
- Click on the dropdown menu galaxy-dropdown next to your workflow
- Select Run from the list
- Configure the workflow as needed
- Click the Run Workflow button at the top-right of the screen
- You may have to refresh your history to see the queued jobs
Examine galaxy-eye the taxonomy
output of Classify.otu tool.
This is a collection, and the different levels of taxonomy are shown in
the names of the collection elements. In this example we only
calculated one level, 0.03. This means we used a 97% similarity
threshold. This threshold is commonly used to differentiate at species level.
Opening the taxonomy output for level 0.03 (meaning 97% similarity, or species level) shows a file structured like the following:
OTU Size Taxonomy
..
Otu0008 5260 Bacteria(100);"Bacteroidetes"(100);"Bacteroidia"(100);"Bacteroidales"(100);"Rikenellaceae"(100);Alistipes(100);
Otu0009 3613 Bacteria(100);"Bacteroidetes"(100);"Bacteroidia"(100);"Bacteroidales"(100);"Porphyromonadaceae"(100);"Porphyromonadaceae"_unclassified(100);
Otu0010 3058 Bacteria(100);Firmicutes(100);Bacilli(100);Lactobacillales(100);Lactobacillaceae(100);Lactobacillus(100);
Otu0011 2958 Bacteria(100);"Bacteroidetes"(100);"Bacteroidia"(100);"Bacteroidales"(100);"Porphyromonadaceae"(100);"Porphyromonadaceae"_unclassified(100);
Otu0012 2134 Bacteria(100);"Bacteroidetes"(100);"Bacteroidia"(100);"Bacteroidales"(100);"Porphyromonadaceae"(100);"Porphyromonadaceae"_unclassified(100);
Otu0013 1856 Bacteria(100);Firmicutes(100);Bacilli(100);Lactobacillales(100);Lactobacillaceae(100);Lactobacillus(100);
..
The first line shown in the snippet above indicates that Otu008 occurred 5260 times, and that all of the sequences (100%) were binned in the genus Alistipes.
question Question
Which samples contained sequences belonging to an OTU classified as Staphylococcus?
solution Solution
Examine the
tax.summary
file output by Classify.otu tool.Samples F3D141, F3D142, F3D144, F3D145, F3D2. This answer can be found by examining the tax.summary output and finding the columns with nonzero values for the line of Staphylococcus
Before we continue, let’s remind ourselves what we set out to do. Our original question was about the stability of the microbiome and whether we could observe any change in community structure between the early and late samples.
Diversity Analysis
exchange Switch to extended tutorial
Species diversity is a valuable tool for describing the ecological complexity of a single sample (alpha diversity) or between samples (beta diversity). However, diversity is not a physical quantity that can be measured directly, and many different metrics have been proposed to quantify diversity by Finotello et al. 2016.
comment Background: Species Diversity
Species diversity consists of three components: species richness, taxonomic or phylogenetic diversity and species evenness.
- Species richness = the number of different species in a community.
- Species evenness = how even in numbers each species in a community is.
- Phylogenetic diversity = how closely related the species in a community are.
Each of these factors play a role in diversity, but how to combine them into a single measure of diversity is nontrivial. Many different metrics have been proposed for this, for example: shannon, chao, pd, ace, simpson, sobs, jack, npshannon, smithwilson, heip bergerparker, boney, efron, shen, solow, bootstrap, qstat, coverage, anderberg, hamming, jclass, jest, ochiai, canberra, thetayc, invsimpson, just to name a few ;). A comparison of several different diversity metrics is discussed in Bonilla-Rosso et al. 2012
question Question
To understand the difference between richness and evenness, consider the following example:
- Which of these communities has the highest richness?
- Which of these communities has the highest evenness?
solution Solution
- Both communities have 4 different species, so they have same richness.
- Community B is more even, because each species has the same abundance.
Even when two samples have identical richness and evenness, we still may conclude that one is more diverse than the other if the species are very dissimilar in one of the samples (have high phylogenetic distance), but very closely related to each other in the second sample.
Now, you do not need to know what all these different metrics are, but just remember that there is not a single definition of diversity and as always, the metric you choose to use may influence your results.
Alpha diversity
In order to estimate alpha diversity of the samples, we first generate the rarefaction curves. Recall that rarefaction measures the number of observed OTUs as a function of the subsampling size.
comment Background: Rarefaction
To estimate the fraction of species sequenced, rarefaction curves are typically used. A rarefaction curve plots the number of species as a function of the number of individuals sampled. The curve usually begins with a steep slope, which at some point begins to flatten as fewer species are being discovered per sample: the gentler the slope, the less contribution of the sampling to the total number of operational taxonomic units or OTUs.
Green, most or all species have been sampled; blue, this habitat has not been exhaustively sampled; red, species rich habitat, only a small fraction has been sampled.
(A Primer on Metagenomics Wooley et al. 2010 )
We will use a plotting tool to visualize the rarefaction curves, and use Summary.single tool to calculate a number of different alpha diversity metrics on all our samples.
hands_on Hands-on: Alpha Diversity
- Import the workflow into Galaxy
- Copy the URL (e.g. via right-click) of this workflow or download it to your computer.
- Import the workflow into Galaxy
tip Tip: Importing a workflow
- Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.
- Click on the upload icon galaxy-upload at the top-right of the screen
- Provide your workflow
- Option 1: Paste the URL of the workflow into the box labelled “Archived Workflow URL”
- Option 2: Upload the worflow file in the box labelled “Archived Workflow File”
- Click the Import workflow button
- Run Workflow 6: Alpha Diversity workflow using the following parameters:
- “Send results to a new history”:
No
- param-file “1: Shared File”: the
Shared file
output from Make.shared tooltip Tip: Running a workflow
- Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.
- Click on the dropdown menu galaxy-dropdown next to your workflow
- Select Run from the list
- Configure the workflow as needed
- Click the Run Workflow button at the top-right of the screen
- You may have to refresh your history to see the queued jobs
View the rarefaction plot output. From this image can see that the rarefaction curves for all samples have started to level off so we are confident we cover a large part of our sample diversity:
View the summary
output from Summary.single tool. This shows several alpha diversity metrics:
- sobs: observed richness (number of OTUs)
- coverage: Good’s coverage index
- invsimpson: Inverse Simpson Index
- nseqs: number of sequences
label group sobs coverage invsimpson invsimpson_lci invsimpson_hci nseqs
0.03 F3D0 167.000000 0.994697 25.686387 24.648040 26.816067 6223.000000
0.03 F3D1 145.000000 0.994030 34.598470 33.062155 36.284520 4690.000000
0.03 F3D141 154.000000 0.991060 19.571632 18.839994 20.362390 4698.000000
0.03 F3D142 141.000000 0.978367 17.029921 16.196090 17.954269 2450.000000
0.03 F3D143 135.000000 0.980738 18.643635 17.593785 19.826728 2440.000000
0.03 F3D144 161.000000 0.980841 15.296728 14.669208 15.980336 3497.000000
0.03 F3D145 169.000000 0.991222 14.927279 14.494740 15.386427 5582.000000
0.03 F3D146 161.000000 0.989167 22.266620 21.201364 23.444586 3877.000000
0.03 F3D147 210.000000 0.995645 15.894802 15.535594 16.271013 12628.000000
0.03 F3D148 176.000000 0.995725 17.788205 17.303206 18.301177 9590.000000
0.03 F3D149 194.000000 0.994957 21.841083 21.280343 22.432174 10114.000000
0.03 F3D150 164.000000 0.989446 23.553161 22.462533 24.755101 4169.000000
0.03 F3D2 179.000000 0.998162 15.186238 14.703161 15.702137 15774.000000
0.03 F3D3 127.000000 0.994167 14.730640 14.180453 15.325243 5315.000000
0.03 F3D5 138.000000 0.990523 29.415378 28.004777 30.975621 3482.000000
0.03 F3D6 155.000000 0.995339 17.732145 17.056822 18.463148 6437.000000
0.03 F3D7 126.000000 0.991916 13.343631 12.831289 13.898588 4082.000000
0.03 F3D8 158.000000 0.992536 23.063894 21.843396 24.428855 4287.000000
0.03 F3D9 162.000000 0.994803 24.120541 23.105499 25.228865 5773.000000
There are a couple of things to note here:
- The differences in diversity and richness between early and late time points is small.
- All sample coverage is above 97%.
There are many more diversity metrics, and for more information about the different calculators available in mothur, see the mothur wiki page
We could perform additional statistical tests (e.g. ANOVA) to confirm our feeling that there is no significant difference based on sex or early vs. late, but this is beyond the scope of this tutorial.
Beta diversity
Beta diversity is a measure of the similarity of the membership and structure found between different samples.
The default calculator in the following section is thetaYC, which is the Yue & Clayton theta similarity
coefficient. We will also calculate the Jaccard index (termed jclass
in mothur).
In the following workflow we will:
- Calculate pairwise distances between samples using the thetaYC calculator (Dist.shared tool)
- Create a Venn diagram to show the number of overlapping OTUs between 4 of our samples
- Create a heatmap of the intersample similarities (Heatmap.sim tool)
- Create pylogenetic tree showing the relatedness of samples (Newick Display tool)
hands_on Hands-on: Beta Diversity
- Import the workflow into Galaxy
- Copy the URL (e.g. via right-click) of this workflow or download it to your computer.
- Import the workflow into Galaxy
tip Tip: Importing a workflow
- Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.
- Click on the upload icon galaxy-upload at the top-right of the screen
- Provide your workflow
- Option 1: Paste the URL of the workflow into the box labelled “Archived Workflow URL”
- Option 2: Upload the worflow file in the box labelled “Archived Workflow File”
- Click the Import workflow button
- Run Workflow 7: Beta Diversity workflow using the following parameters:
- “Send results to a new history”:
No
- param-file “1: Shared File”: the
Shared file
output from Make.shared tool- param-collection “2: Subsample shared”: the
shared
output from Sub.sample tooltip Tip: Running a workflow
- Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.
- Click on the dropdown menu galaxy-dropdown next to your workflow
- Select Run from the list
- Configure the workflow as needed
- Click the Run Workflow button at the top-right of the screen
- You may have to refresh your history to see the queued jobs
Look at some of the resulting heatmaps (you may have to download the SVG images first). In all of these heatmaps the red colors indicate communities that are more similar than those with black colors.
For example this is the heatmap for the thetayc
calculator (output thetayc.0.03.lt.ave
):
and the jclass calulator (output jclass.0.03.lt.ave
):
Examine the Venn diagram:
This shows that there were a total of 180 OTUs observed between the 4 time points. Only 76 of those OTUs were shared by all four time points. We could look deeper at the shared file to see whether those OTUs were numerically rare or just had a low incidence.
Inspection of the the tree shows that the early and late communities cluster with themselves to the exclusion of the others.
thetayc.0.03.lt.ave
:
jclass.0.03.lt.ave
:
Visualisations
Krona
A tool we can use to visualize the composition of our community, is Krona
hands_on Hands-on: Krona
First we convert our mothur taxonomy file to a format compatible with Krona
- Taxonomy-to-Krona tool with the following parameters
- param-collection “Taxonomy file”: the
taxonomy
output from Classify.otu- Krona pie chart tool with the following parameters
- “Type of input”:
Tabular
- param-collection “Input file”: the
taxonomy
output from Taxonomy-to-Krona tool
The resulting file is an HTML file containing an interactive visualization. For instance try double-clicking the innermost ring labeled “Bacteroidetes” below:
question Question
What percentage of your sample was labelled
Lactobacillus
?solution Solution
Explore the Krona plot, double click on Firmicutes, here you should see Lactobacillus clearly (16% in our case), click on this segment and the right-hand side will show you the percentages at any point in the hierarchy (here 5% of all)
Exercise: generating per-sample Krona plots (Optional)
You may have noticed that this plot shows the results for all samples together. In many cases however, you would like to be able to compare results for different samples.
In order to save computation time, mothur pools all reads into a single file, and uses
the count table
file to keep track of which samples the reads came from. However, Krona
does not understand the mothur count table format, so we cannot use that to supply information
about the groups. But luckily we can get Classify.otu tool to output per-sample
taxonomy files. In the following exercise, we will create a Krona plot with per-sample subplots.
question Exercise: per-sample plots
Try to create per-sample Krona plots. An few hints are given below, and the full answer is given in the solution box.
- Re-run galaxy-refresh the Classify.otu tool tool we ran earlier
- See if you can find a parameter to output a taxonomy file per sample (group)
- Run Taxonomy-to-Krona tool again on the per-sample taxonomy files (collection)
- Run Krona tool
solution Full Solution
- Find the previous run of Classify.otu tool in your history
- Hit the rerun button galaxy-refresh to load the parameters you used before:
- param-file “list”: the
list
output from Cluster.split tool- param-file “count”: the
count table
from Remove.groups tool- param-file “taxonomy”: the
taxonomy
output from Remove.groups tool- “label”:
0.03
- Add new parameter setting:
- “persample - allows you to find a consensus taxonomy for each group”:
Yes
You should now have a collection with per-sample files- Taxonomy-to-Krona tool with the following parameters
- param-collection “Taxonomy file”: the
taxonomy
collection from Classify.otu tool- Krona pie chart tool with the following parameters
- “Type of input”:
Tabular
- param-collection “Input file”: the collection from Taxonomy-to-Krona tool
- “Combine data from multiple datasets?”:
No
The final result should look something like this (switch between samples via the list on the left):
Phinch
We may now wish to further visualize our results. We can convert our shared file to the more widely used biom
format and
view it in a platform like Phinch.
hands_on Hands-on: Phinch
- Make.biom tool with the following parameters
- param-collection “shared”: the output from Sub.sample tool
- param-collection “constaxonomy”: the
taxonomy
output from Classify.otu tool- param-file “metadata”: the
mouse.dpw.metadata
file you uploaded at the start of this tutorial- View the file in Phinch
- If you expand the the output biom dataset, you will see a link to view the file at Phinch
- Click on this link (“view biom at Phinch”)
This link will lead you to a Phinch server (hosted by Galaxy), which will automatically load your file, and where you can several interactive visualisations:
comment No link to Phinch on your dataset?
If this visualisation link is not present on your Galaxy dataset, you can download the generated BIOM file from Galaxy, and upload it directly to the Phinch server at https://usegalaxy.eu/phinch/.
Important: After downloading, please change the file extension from
.biom1
to.biom
before uploading to Phinch.Note: This link will visualize your data in Phinch version 1. Recently, Phinch have released version 2 as a desktop application. This file can also be visualized in Phinch2, but requires installation of Phinch to your local machine
Conclusion
Well done! trophy You have completed the basics of the Schloss lab’s Standard Operating Procedure for Illumina MiSeq data. You have worked your way through the following pipeline:
keypoints Key points
16S rRNA gene sequencing analysis results depend on the many algorithms used and their settings
Quality control and cleaning of your data is a crucial step in order to obtain optimal results
Adding a mock community to serve as a control sample can help you asses the error rate of your experimental setup
We can explore alpha and beta diversities using Krona and Phinch for dynamic visualizations
Useful literature
Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.
References
- DeSantis, T. Z., P. Hugenholtz, N. Larsen, M. Rojas, E. L. Brodie et al., 2006 Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB. Applied and Environmental Microbiology 72: 5069–5072. 10.1128/aem.03006-05
- Liu, Z., T. Z. DeSantis, G. L. Andersen, and R. Knight, 2008 Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers. Nucleic Acids Research 36: e120–e120. 10.1093/nar/gkn491
- Schloss, P. D., S. L. Westcott, T. Ryabin, J. R. Hall, M. Hartmann et al., 2009 Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities. Applied and Environmental Microbiology 75: 7537–7541. 10.1128/aem.01541-09
- Wooley, J. C., A. Godzik, and I. Friedberg, 2010 A Primer on Metagenomics (P. E. Bourne, Ed.). PLoS Computational Biology 6: e1000667. 10.1371/journal.pcbi.1000667
- Federhen, S., 2011 The NCBI Taxonomy database. Nucleic Acids Research 40: D136–D143. 10.1093/nar/gkr1178
- Bonilla-Rosso, G., L. E. Eguiarte, D. Romero, M. Travisano, and V. Souza, 2012 Understanding microbial community diversity metrics derived from metagenomes: performance evaluation using simulated data sets. FEMS Microbiology Ecology 82: 37–49. 10.1111/j.1574-6941.2012.01405.x
- Schloss, P. D., 2012 Secondary structure improves OTU assignments of 16S rRNA gene sequences. The ISME Journal 7: 457–460. 10.1038/ismej.2012.102
- Quast, C., E. Pruesse, P. Yilmaz, J. Gerken, T. Schweer et al., 2012 The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Research 41: D590–D596. 10.1093/nar/gks1219
- Cole, J. R., Q. Wang, J. A. Fish, B. Chai, D. M. McGarrell et al., 2013 Ribosomal Database Project: data and tools for high throughput rRNA analysis. Nucleic Acids Research 42: D633–D642. 10.1093/nar/gkt1244
- Fouhy, F., A. G. Clooney, C. Stanton, M. J. Claesson, and P. D. Cotter, 2016 16S rRNA gene sequencing of mock microbial populations- impact of DNA extraction method, primer choice and sequencing platform. BMC Microbiology 16: 10.1186/s12866-016-0738-z
- Singer, E., B. Andreopoulos, R. M. Bowers, J. Lee, S. Deshpande et al., 2016 Next generation sequencing data of a defined microbial mock community. Scientific Data 3: 160081. 10.1038/sdata.2016.81
- Rognes, T., T. Flouri, B. Nichols, C. Quince, and F. Mahé, 2016 VSEARCH: a versatile open source tool for metagenomics. PeerJ 4: e2584. 10.7717/peerj.2584
- Finotello, F., E. Mastrorilli, and B. D. Camillo, 2016 Measuring the diversity of the human microbiota with targeted next-generation sequencing. Briefings in Bioinformatics bbw119. 10.1093/bib/bbw119
- Balvočiūtė, M., and D. H. Huson, 2017 SILVA, RDP, Greengenes, NCBI and OTT — how do these taxonomies compare? BMC Genomics 18: 10.1186/s12864-017-3501-4
Citing this Tutorial
- Saskia Hiltemann, Bérénice Batut, Dave Clements, 2020 16S Microbial Analysis with mothur (short) (Galaxy Training Materials). /training-material/topics/metagenomics/tutorials/mothur-miseq-sop-short/tutorial.html Online; accessed Wed Sep 02 2020
- Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012