Episodes
In this episode, I continue to talk (but mostly listen) to Sergey Koren and Sergey Nurk.
If you missed the previous episode, you should probably start there.
Otherwise, join us to learn about HiFi reads, the tradeoff between read length
and quality, and what tricks HiCanu employs to resolve highly similar repeats.
Links:
HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads (Sergey Nurk, Brian P....
Published 05/27/20
In this episode Sergey Nurk and Sergey Koren from the NIH share their thoughts
on genome assembly. The two Sergeys tell the stories behind their amazing
careers as well as behind some of the best known genome assemblers: Celera
assembler, Canu, and SPAdes.
Links:
Canu on GitHub
SPAdes on GitHub
Published 05/20/20
Porcupine is a molecular tagging system—a way to tag physical
objects with pieces of DNA called molecular bits, or molbits for short.
These DNA tags then can be rapidly sequenced on an Oxford Nanopore MinION
device without any need for library preparation.
In this episode Katie Doroschak explains how Porcupine works—how molbits
are designed and prepared, and how they are directly recognized by the
software without an intermediate basecalling step.
Links:
...
Published 04/29/20
Will Townes proposes a new, simpler way to analyze scRNA-seq data with unique
molecular identifiers (UMIs). Observing that such data is not zero-inflated,
Will has designed a PCA-like procedure inspired by generalized linear models
(GLMs) that, unlike the standard PCA, takes into account statistical
properties of the data and avoids spurious correlations (such as one or more
of the top principal components being correlated with the number of non-zero
gene counts).
Also check out Will’s...
Published 03/27/20
Help shape the future of the podcast—please take the listener survey!
In this episode we hear from Amatur Rahman
and Karel Břinda, who
independently of one another released preprints on the same concept, called
simplitigs or spectrum-preserving string sets. Simplitigs offer a way to
efficiently store and query large sets of k-mers—or, equivalently, large de
Bruijn graphs.
Links:
Simplitigs as an efficient and scalable representation of de Bruijn...
Published 02/28/20
Help shape the future of the podcast—please take the listener survey!
Kris Parag is here to teach us about the mathematical modeling of
infectious disease epidemics. We discuss the SIR model, the renewal models, and how
insights from information theory can help us predict where an epidemic is
going.
Links:
Optimising Renewal Models for Real-Time Epidemic Prediction and Estimation (KV Parag, CA Donnelly)
Adaptive Estimation for Epidemic Renewal and...
Published 01/27/20
Does a given bacterial gene live on a plasmid or the chromosome? What
other genes live on the same plasmid?
In this episode, we hear from Sergio Arredondo-Alonso and Anita Schürch, whose
projects mlplasmids and gplas answer these types of questions.
Links:
mlplasmids: a user-friendly tool to predict plasmid- and chromosome-derived sequences for single species (Sergio Arredondo-Alonso, Malbert R. C. Rogers, Johanna C. Braat, Tess D. Verschuuren, Janetta Top,...
Published 12/30/19
In this episode Benjamin Callahan talks about some of the issues faced by
microbiologists when conducting metagenomic studies. The two main themes are:
Why one should probably avoid using OTUs (operational taxonomic units) and
use exact sequence variants (also called amplicon sequence variants, or
ASVs), and how DADA2 manages to deduce the exact sequences present in the
sample.
Why abundances inferred from metagenomics data are biased, and
how we can model and correct this...
Published 11/29/19
In this episode Luke Anderson-Trocmé
talks about his findings from the 1000 Genomes Project. Namely, the early
sequenced genomes sometimes contain specific mutational signatures that
haven’t been replicated from other sources and can be found via their
association with lower base quality scores. Listen to Luke telling the story
of how he stumbled upon and investigated these fake variants and what their
impact is.
Links:
Legacy Data Confounds Genomics Studies...
Published 10/22/19
In this episode I talk with Irineo Cabreros about causality. We discuss why
causality matters, what does and does not imply causality, and two
different mathematical formalizations of causality: potential outcomes and
directed acyclic graphs (DAGs). Causal models are
usually considered external to and separate from statistical models, whereas
Irineo’s new paper shows how causality can be viewed as a relationship between
particularly chosen random variables (potential outcomes).
...
Published 09/27/19
In this episode we hear from Romain Lopez and Gabriel Misrachi about
scVI—Single-cell Variational Inference.
scVI is a probabilistic model for single-cell gene expression data that
combines a hierarchical Bayesian model with deep neural networks encoding the
conditional distributions. scVI scales to over one million cells and can be
used for scRNA-seq normalization and batch effect removal, dimensionality
reduction, visualization, and differential expression. We also
discuss the recently...
Published 08/30/19
Even though the double-stranded DNA has the famous regular helical shape,
there are small variations in the geometry of the helix depending on what
exact nucleotides its made of at that position.
In this episode of the bioinformatics chat, Hassan Samee talks about the
role the DNA shape plays in recognition of the DNA by DNA-binding proteins,
such as transcription factors. Hassan also explains how his algorithm, ShapeMF,
can deduce the DNA shape motifs from the ChIP-seq data.
...
Published 07/26/19
An αβ T-cell receptor is composed of two highly variable protein chains, the α
chain and the β chain. However, based only on bulk DNA or RNA sequencing it is
impossible to determine which of the α chain and β chain sequences were paired
in the same receptor.
In this episode Kristina Grigaityte talks about her analysis of 200,000
paired αβ sequences, which have been obtained by targeted single-cell RNA sequencing.
Kristina used the power law distribution to model the T-cell clone...
Published 06/29/19
Modern genome assembly projects are often based on long reads in an attempt to
bridge longer repeats. However, due to the higher error rate of the current
long read sequencers, assemblers based on de Bruijn graphs do not work well in
this setting, and the approaches that do work are slower.
In this episode Mikhail Kolmogorov from
Pavel Pevzner’s lab joins us to talk about some of the ideas developed in the
lab that made it possible to build a de Bruijn-like assembly graph from noisy
reads....
Published 05/31/19
In this episode we hear from Jacob Schreiber about his algorithm,
Avocado.
Avocado uses deep tensor factorization to break a three-dimensional tensor of
epigenomic data into three orthogonal dimensions corresponding to cell types,
assay types, and genomic loci. Avocado can extract a low-dimensional,
information-rich latent representation from the wealth of experimental data
from projects like the Roadmap Epigenomics Consortium and ENCODE. This
representation allows you to impute genome-wide...
Published 04/29/19
The third Bioinformatics Contest took place in
February 2019.
Alexey Sergushichev, one of the organizers of the contest,
and Gennady Korotkevich, the 1st prize winner,
join me to discuss this year’s problems.
Published 03/24/19
Hi-C is a sequencing-based assay that provides information about the 3-dimensional organization of the genome.
In this episode Simeon Carstens explains how he
applied the Inferential Structure Determination (ISD) framework to build a 3D
model of chromatin and fit that model to Hi-C data using Hamiltonian Monte
Carlo and Gibbs sampling.
Published 02/27/19
Long read sequencing technologies, such as Oxford Nanopore and PacBio,
produce reads from thousands to a million base pairs in length,
at the cost of the increased error rate. Trevor Pesout
describes how he and his colleagues leverage long reads for simultaneous
variant calling/genotyping and phasing. This is possible thanks to a clever
use of a hidden Markov model, and two different algorithms based on this model
are now implemented in
the MarginPhase and WhatsHap tools.
Published 01/27/19
This time you’ll hear from Fabio Cunial on the topic of Markov models and
space-efficient data structures. First we recall what a Markov model is and
why variable-order Markov models are an improvement over the standard,
fixed-order models. Next we discuss the various data structures and indexes
that allowed Fabio and his collaborators to represent these models in a very
small space while still keeping the queries efficient. Burrows-Wheeler
transform, suffix trees and arrays, tries and suffix...
Published 12/28/18
In this episode HoJoon Lee and Seung Woo Cho explain how to perform a CRISPR
experiment and how to analyze its results. HoJoon and Seung Woo developed an
algorithm that analyzes sequenced amplicons containing the CRISPR-induced
double-strand break site and figures out what exactly happened there (e.g.
a deletion, insertion, substitution etc.)
Published 11/29/18
Relief is a statistical method to perform feature selection. It could be used,
for instance, to find genomic loci that correlate with a trait or genes whose
expression correlate with a condition. Relief can also be made sensitive to
interaction effects (known in genetics as epistasis).
In this episode Trang Lê joins me
to talk about Relief and her version of Relief called STIR (STatistical
Inference Relief). While traditional Relief algorithms could only rank
features and needed a...
Published 10/27/18
Kaushik Panda and Keith Slotkin come on the podcast to educate us about
repetitive DNA and transposable elements. We talk LINEs, SINEs, LTRs, and even
Sleeping Beauty transposons! Kaushik and Keith explain why repeats matter for your
whole-genome analysis and answer listeners’ questions.
Published 09/24/18
Antoine Limasset joins me to talk about NGS read correction.
Antoine and his colleagues built the read correction tool Bcool based on the
de Bruijn graph, and it corrects reads far better than any of the current methods
like Bloocoo, Musket, and Lighter.
We discuss why and when read correction is needed, how Bcool works, and why
it performs better but slower than k-mer spectrum methods.
Published 08/31/18
In this episode I talk to Fernando Portela,
a software engineer and
amateur scientist
who works on RNA design — the problem of composing an RNA sequence
that has a specific secondary structure.
We talk about how Fernando and others compete and collaborate in designing RNA
molecules in the online game EteRNA and about Fernando’s new
RNA design algorithm, NEMO, which outperforms all prior published methods by a wide margin.
Published 07/27/18
In this episode I’m joined by Chang Xu. Chang is a senior biostatistician
at QIAGEN and an author of smCounter2, a low-frequency somatic variant caller.
To distinguish rare somatic mutations from sequencing errors, smCounter2
relies on unique molecular identifiers, or UMIs, which help identify multiple
reads resulting from the same physical DNA fragment.
Chang explains what UMIs are, why they are useful, and how smCounter2 and other
tools in this space use UMIs to detect low-frequency variants.
Published 06/29/18