The top portion of the campus entrance gate showing IISER Pune logo

Refined transcription factor DNA-binding motif discovery from pangenomic ChIP-seq, ATAC-seq and similar datasets.

By Denis Thieffry, PSL University, Paris, France

Seminar hall 51, 4th floor

Abstract 

The development of high-throughput sequencing (HTS) techniques has opened up new avenues
for identifying, modelling and predicting DNA motifs bound by transcription factors (TFs). On
the one hand, provided that a good antibody is available, chromatin immunoprecipitation
assays coupled with HTS (ChIP-seq) can capture most TF-bound sequences in a given cell type
or tissue at the genomic scale. On the other hand, epigenomic assays, including whole-genome
bisulphite sequencing (WGBS) and combinations of ChIP-seq assays targeting chromatin
marks, can be used to identify potential promoter and enhancer regions.
Using these datasets, various types of computational analyses can be performed to deduce
potentially related transcription factors. The most common approach is to analyse putative cis-
regulatory sequences (promoters or enhancers) using collections of probabilistic models of
transcription factor binding sites, typically in the form of position weight matrices (PWMs),
which can be found in public databases such as JASPAR (https://jaspar.elixir.no/). However,
this approach is inherently limited by the quality of the available PWM sets.
Another approach is to apply pattern discovery algorithms to regions presumed to be co-
regulated, then compare the patterns obtained with public collections of PWMs. Pattern
discovery algorithms (e.g., Gibbs samplers, MEME) typically perform multiple local
alignments on a set of sequences, which requires pre-filtering and heuristic sampling to process
large sets (thousands) of sequences, at the risk of missing subtle variations in the patterns.
To overcome the shortcomings of these multiple alignment approaches, Jacques van Helden
initiated the development of a set of tools based on k-mer counting and multinomial statistics
to identify words that are overrepresented in large sequence datasets and to construct refined
PWMs (http://rsat.eu).
More recently, thanks to the accumulation of ChIP-seq data for various transcription factors,
combined with WGBS data, in the same well-established cell lines, it has become possible to
study in greater detail the impact of DNA methylation on transcription factor binding. By
combining ChIP-seq datasets targeting various dimeric transcription factor partners in the same
cell lines, Touati Benoukraf and collaborators were able to define refined PWMs for each
dimer, containing higher information content than the degenerate motifs encoded in public
databases. These refined motifs are now available in the MethMotif database
(https://methmotif.org), while a series of functions written in the R programming language,
grouped in the TFregulomeR package, is shared on github to ease the analysis of new ChIP-seq
and WGBS datasets (https://github.com/benoukraflab/TFregulomeR).