In this exercise, you will analyze two population (metagenomic) samples
using _breseq to track the frequencies of evolved alleles and
changes in genetic diversity in population Ara-3 of the Lenski long-term
evolution experiment (LTEE). As discussed in tutorial-clones this
population evolved citrate utilization after 31,500 generations.
1. Download data files¶
First, create a directory called tutorial_populations:
$ mkdir tutorial_populations
$ cd tutorial_populations
Reference sequence¶
_breseq prefers the reference sequence in Genbank or GFF3 format. In this example, the reference sequence is the ancestral Escherichia coli B strain REL606. For this tutorial, we are going to use an updated version of the GenBank genome record (accession:NC_012967) that contains richer gene annotation information than the version available from NCBI.
Read files¶
We're going to use Illumina genome re-sequencing data from mixed populations that evolved for up to 40,000 generations in a long-term evolution experiment [Blount2008] [Blount2011]. This data is available in the European Nucleotide Archive (ENA). Go to https://www.ebi.ac.uk/ and search for the accession number: SRR1721884. Then click on the accession number to open the record and download the two FASTQ files using the links in the 'ftp' column.
This particular sample was taken at 20,000 generations from population Ara-3. You'll use it to illustrate running _breseq in polymorphism mode and the consequences of different filtering options for ruling out false-positives. If you would like to access the entire time-series of population samples from this population check out SRP051254.
2. Run _breseq with default filters¶
Check to be sure that you have changed into the tutorial_clonal_samples directory and that you have all of the input files (and have uncompressed them).
$ ls
NC_012967.gbk SRR030257_1.fastq SRR030257_2.fastq
Now, run _breseq using this command:
$ breseq -j 8 -p -o REL8595M-default -r REL606.gbk SRR030257_1.fastq SRR030257_2.fastq
This command is expected to take roughly 30 minutes to an hour to complete.
Open REL8595M-default/output/index.html. Examine the mutation lines
that are highlighted in green, which are predicted to be polymorphic in
the mixed population sample (the predicted allele has an estimated
maximum likelihood frequency between 0% and 100%). Click on some of the
RA and JC links for these items. How are they different from
those that you observed when _breseq was run in consensus mode on a
clone?
3. Run _breseq with no filters¶
Predicting polymorphisms is very prone to false-positives (wrongly predicting genetic variation that is not actually present in a sample). This is, in part, because sequencing data has hotspots and biases that are difficult to adequately capture in a statistical error model (particularly when analyzing only one sample at a time, like in this example). _breseq has some default options that can be used to filter out variant predictions that look suspicious because of certain biases that they exhibit with respect to how reads align to them versus how they typically align to "normal" positions in the genome.
In consensus mode, _breseq results are generally robust to different sequencing coverage depths and types of samples. In polymorphism mode, _breseq often needs some tuning of parameters and statistical cutoffs depending on characteristics of the input data set in order to not predict either too many (false-positives) or too few (false-negatives) polymorphisms. In addition, it may be necessary to perform more complex analyses of multiple samples or of time courses to gain extra power for discriminating true polymorphisms from errors. Unfortunately, these approaches are outside the scope of a simple _breseq command!
Bring up the full _breseq help:
$ breseq -h
The relevant options are listed under Polymorphism (Mixed Population) Options. Now, we're going to do a _breseq run in which we disable all of the filters for comparison to the initial run:
$ breseq -j 8 -p --polymorphism-reject-indel-homopolymer-length 0 --polymorphism-reject-surrounding-homopolymer-length 0 --polymorphism-bias-cutoff 0 --polymorphism-minimum-coverage-each-strand 0 -o REL8595M-no-filters -r REL606.gbk SRR030257_1.fastq SRR030257_2.fastq
This command is expected to take roughly 30 minutes to an hour to complete.
4. Compare predictions of mutations¶
Open no-filters/output/index.html. See if you can find examples of
mutations that are probably due to different types of sequencing biases
by delving into the original _breseq HTML files that show the read
alignments (RA).
You might first want to create a comparison table of the results from the two _breseq runs.
$ cp REL8595M-default/output/output.gd default.gd
$ cp REL8595M-no-filters/output/output.gd no-filters.gd
$ gdtools COMPARE -o compare.html -r REL606.gbk default.gd no-filters.gd
Can you find any predictions that look like plausible mutations that were incorrectly rejected by the default filters?
5. Examine allele frequency time courses¶
Since it would take a long time to create results for all of the mixed population samples, download these output files pre-generated with _breseq using the default polymorphism filtering options in order to continue the tutorial:
If you look at these files, you will also notice that metadata (experiment, population, generation) has been added to these files that enables them to be properly sorted into order.
Make a compare table for all of these files.
$ cd population_gd
$ gdtools COMPARE -r ../REL606.gbk -o ../time-course.html `ls *.gd`
Open the HTML output file and look at the trajectories of mutations that appear early and later.
Here are a few questions to get you started thinking about the data:
- What do you notice about the last sample, REL11151 (45,000 generations)? It's located furthest to the right.
- What other potential problems do you notice with the output?