Episodes
In this episode, I continue to talk (but mostly listen) to Sergey Koren and Sergey Nurk. If you missed the previous episode, you should probably start there. Otherwise, join us to learn about HiFi reads, the tradeoff between read length and quality, and what tricks HiCanu employs to resolve highly similar repeats. Links: HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads (Sergey Nurk, Brian P....
Published 05/27/20
In this episode Sergey Nurk and Sergey Koren from the NIH share their thoughts on genome assembly. The two Sergeys tell the stories behind their amazing careers as well as behind some of the best known genome assemblers: Celera assembler, Canu, and SPAdes. Links: Canu on GitHub SPAdes on GitHub
Published 05/20/20
Porcupine is a molecular tagging system—a way to tag physical objects with pieces of DNA called molecular bits, or molbits for short. These DNA tags then can be rapidly sequenced on an Oxford Nanopore MinION device without any need for library preparation. In this episode Katie Doroschak explains how Porcupine works—how molbits are designed and prepared, and how they are directly recognized by the software without an intermediate basecalling step. Links: ...
Published 04/29/20
Will Townes proposes a new, simpler way to analyze scRNA-seq data with unique molecular identifiers (UMIs). Observing that such data is not zero-inflated, Will has designed a PCA-like procedure inspired by generalized linear models (GLMs) that, unlike the standard PCA, takes into account statistical properties of the data and avoids spurious correlations (such as one or more of the top principal components being correlated with the number of non-zero gene counts). Also check out Will’s...
Published 03/27/20
Help shape the future of the podcast—please take the listener survey! In this episode we hear from Amatur Rahman and Karel Břinda, who independently of one another released preprints on the same concept, called simplitigs or spectrum-preserving string sets. Simplitigs offer a way to efficiently store and query large sets of k-mers—or, equivalently, large de Bruijn graphs. Links: Simplitigs as an efficient and scalable representation of de Bruijn...
Published 02/28/20
Help shape the future of the podcast—please take the listener survey! Kris Parag is here to teach us about the mathematical modeling of infectious disease epidemics. We discuss the SIR model, the renewal models, and how insights from information theory can help us predict where an epidemic is going. Links: Optimising Renewal Models for Real-Time Epidemic Prediction and Estimation (KV Parag, CA Donnelly) Adaptive Estimation for Epidemic Renewal and...
Published 01/27/20
Does a given bacterial gene live on a plasmid or the chromosome? What other genes live on the same plasmid? In this episode, we hear from Sergio Arredondo-Alonso and Anita Schürch, whose projects mlplasmids and gplas answer these types of questions. Links: mlplasmids: a user-friendly tool to predict plasmid- and chromosome-derived sequences for single species (Sergio Arredondo-Alonso, Malbert R. C. Rogers, Johanna C. Braat, Tess D. Verschuuren, Janetta Top,...
Published 12/30/19
In this episode Benjamin Callahan talks about some of the issues faced by microbiologists when conducting metagenomic studies. The two main themes are: Why one should probably avoid using OTUs (operational taxonomic units) and use exact sequence variants (also called amplicon sequence variants, or ASVs), and how DADA2 manages to deduce the exact sequences present in the sample. Why abundances inferred from metagenomics data are biased, and how we can model and correct this...
Published 11/29/19
In this episode Luke Anderson-Trocmé talks about his findings from the 1000 Genomes Project. Namely, the early sequenced genomes sometimes contain specific mutational signatures that haven’t been replicated from other sources and can be found via their association with lower base quality scores. Listen to Luke telling the story of how he stumbled upon and investigated these fake variants and what their impact is. Links: Legacy Data Confounds Genomics Studies...
Published 10/22/19
In this episode I talk with Irineo Cabreros about causality. We discuss why causality matters, what does and does not imply causality, and two different mathematical formalizations of causality: potential outcomes and directed acyclic graphs (DAGs). Causal models are usually considered external to and separate from statistical models, whereas Irineo’s new paper shows how causality can be viewed as a relationship between particularly chosen random variables (potential outcomes). ...
Published 09/27/19
In this episode we hear from Romain Lopez and Gabriel Misrachi about scVI—Single-cell Variational Inference. scVI is a probabilistic model for single-cell gene expression data that combines a hierarchical Bayesian model with deep neural networks encoding the conditional distributions. scVI scales to over one million cells and can be used for scRNA-seq normalization and batch effect removal, dimensionality reduction, visualization, and differential expression. We also discuss the recently...
Published 08/30/19
Even though the double-stranded DNA has the famous regular helical shape, there are small variations in the geometry of the helix depending on what exact nucleotides its made of at that position. In this episode of the bioinformatics chat, Hassan Samee talks about the role the DNA shape plays in recognition of the DNA by DNA-binding proteins, such as transcription factors. Hassan also explains how his algorithm, ShapeMF, can deduce the DNA shape motifs from the ChIP-seq data. ...
Published 07/26/19
An αβ T-cell receptor is composed of two highly variable protein chains, the α chain and the β chain. However, based only on bulk DNA or RNA sequencing it is impossible to determine which of the α chain and β chain sequences were paired in the same receptor. In this episode Kristina Grigaityte talks about her analysis of 200,000 paired αβ sequences, which have been obtained by targeted single-cell RNA sequencing. Kristina used the power law distribution to model the T-cell clone...
Published 06/29/19
Modern genome assembly projects are often based on long reads in an attempt to bridge longer repeats. However, due to the higher error rate of the current long read sequencers, assemblers based on de Bruijn graphs do not work well in this setting, and the approaches that do work are slower. In this episode Mikhail Kolmogorov from Pavel Pevzner’s lab joins us to talk about some of the ideas developed in the lab that made it possible to build a de Bruijn-like assembly graph from noisy reads....
Published 05/31/19
In this episode we hear from Jacob Schreiber about his algorithm, Avocado. Avocado uses deep tensor factorization to break a three-dimensional tensor of epigenomic data into three orthogonal dimensions corresponding to cell types, assay types, and genomic loci. Avocado can extract a low-dimensional, information-rich latent representation from the wealth of experimental data from projects like the Roadmap Epigenomics Consortium and ENCODE. This representation allows you to impute genome-wide...
Published 04/29/19
The third Bioinformatics Contest took place in February 2019. Alexey Sergushichev, one of the organizers of the contest, and Gennady Korotkevich, the 1st prize winner, join me to discuss this year’s problems.
Published 03/24/19
Hi-C is a sequencing-based assay that provides information about the 3-dimensional organization of the genome. In this episode Simeon Carstens explains how he applied the Inferential Structure Determination (ISD) framework to build a 3D model of chromatin and fit that model to Hi-C data using Hamiltonian Monte Carlo and Gibbs sampling.
Published 02/27/19
Long read sequencing technologies, such as Oxford Nanopore and PacBio, produce reads from thousands to a million base pairs in length, at the cost of the increased error rate. Trevor Pesout describes how he and his colleagues leverage long reads for simultaneous variant calling/genotyping and phasing. This is possible thanks to a clever use of a hidden Markov model, and two different algorithms based on this model are now implemented in the MarginPhase and WhatsHap tools.
Published 01/27/19
This time you’ll hear from Fabio Cunial on the topic of Markov models and space-efficient data structures. First we recall what a Markov model is and why variable-order Markov models are an improvement over the standard, fixed-order models. Next we discuss the various data structures and indexes that allowed Fabio and his collaborators to represent these models in a very small space while still keeping the queries efficient. Burrows-Wheeler transform, suffix trees and arrays, tries and suffix...
Published 12/28/18
In this episode HoJoon Lee and Seung Woo Cho explain how to perform a CRISPR experiment and how to analyze its results. HoJoon and Seung Woo developed an algorithm that analyzes sequenced amplicons containing the CRISPR-induced double-strand break site and figures out what exactly happened there (e.g. a deletion, insertion, substitution etc.)
Published 11/29/18
Relief is a statistical method to perform feature selection. It could be used, for instance, to find genomic loci that correlate with a trait or genes whose expression correlate with a condition. Relief can also be made sensitive to interaction effects (known in genetics as epistasis). In this episode Trang Lê joins me to talk about Relief and her version of Relief called STIR (STatistical Inference Relief). While traditional Relief algorithms could only rank features and needed a...
Published 10/27/18
Kaushik Panda and Keith Slotkin come on the podcast to educate us about repetitive DNA and transposable elements. We talk LINEs, SINEs, LTRs, and even Sleeping Beauty transposons! Kaushik and Keith explain why repeats matter for your whole-genome analysis and answer listeners’ questions.
Published 09/24/18
Antoine Limasset joins me to talk about NGS read correction. Antoine and his colleagues built the read correction tool Bcool based on the de Bruijn graph, and it corrects reads far better than any of the current methods like Bloocoo, Musket, and Lighter. We discuss why and when read correction is needed, how Bcool works, and why it performs better but slower than k-mer spectrum methods.
Published 08/31/18
In this episode I talk to Fernando Portela, a software engineer and amateur scientist who works on RNA design — the problem of composing an RNA sequence that has a specific secondary structure. We talk about how Fernando and others compete and collaborate in designing RNA molecules in the online game EteRNA and about Fernando’s new RNA design algorithm, NEMO, which outperforms all prior published methods by a wide margin.
Published 07/27/18
In this episode I’m joined by Chang Xu. Chang is a senior biostatistician at QIAGEN and an author of smCounter2, a low-frequency somatic variant caller. To distinguish rare somatic mutations from sequencing errors, smCounter2 relies on unique molecular identifiers, or UMIs, which help identify multiple reads resulting from the same physical DNA fragment. Chang explains what UMIs are, why they are useful, and how smCounter2 and other tools in this space use UMIs to detect low-frequency variants.
Published 06/29/18