Abstract
In an era of Biology where modern imaging and sequencing technologies allow to study almost any biological process at molecular levels, it is now possible to study the content, activity and identity of single cells from almost any organism. High-throughput sequencing technologies are now widely used in laboratories around the globe and are able to routinely sequence the genetic content of thousands or even millions of cells. Not only it is used to study basic cellular properties but also to understand the mechanisms underlying disease states and the cellular responses to new treatments. This field of science is very energetic and, because so many efforts are put into (computational) method development, the scientific community needs a barometer to measure its current state and needs. Guidelines and recommendations to analyze these new types of data are critical, which is one of the roles of benchmarks. However the current strategy to perform benchmarks still often relies on building a new evaluation framework from scratch, a process that is often time consuming, not reproducible and prone to partiality. In this thesis, I have tackled these challenges from different angles. First, from an analyst point of view, I participated in the analysis of single-cell data of an immunotherapy experiment and developed a semi-automatic analysis toolkit aiming to facilitate the data analysis procedure of single-cell data for other researchers. Second, from a benchmarker perspective, I evaluated analysis tools used for single-cell data with a focus on how to best retrieve cell populations from single-cell data using clustering algorithms. Finally, I participated in the development of a new computational platform, Omnibenchmark, hosting collaborative and continuous benchmarks with the aim to provide updated method recommendations for the community.
Single-cell RNA sequencing is one of the most popular and most used high-throughput technologies. It yields information about the transcriptional profile (genome-wide or targeted) of single cells and these data hold potentially key information for basic and translational research. For example, it is now possible to identify the response of a set of genes (differential expression) or cells (differential abundance) to a given treatment or condition, retrieve the gene pathways that are involved in a cellular response (pathway analysis), or identify molecular markers that could be used to identify a population of cells (marker gene identification). Ultimately, most of these biological findings rely on a critical step called clustering; an unsupervised machine learning approach that groups data points (here, cells) with similar properties (here, their transcriptomic profiles). In the context of single-cell, the aim of clustering is to group together similar entities, give a proxy of the cell-type heterogeneity in the data and use this information for downstream analyses. However, clustering can be influenced by technical effects and not correctly removing them will bias the classification (by grouping cells by batch, sample or other technical effects instead of relevant biological variations). So-called ‘preprocessing’ steps aim to tackle this issue by removing data variations originating from
technical effects. However, the research field in single cell is bloated with methods performing the same tasks, each one reporting performance scores higher than their direct competitors. In Manuscripts I and II, I contributed to benchmarking efforts to evaluate how processing methods influence the critical step of clustering. In other words, we evaluated the performance of processing methods to remove technical effects in single-cell data to correctly interpret biological effects. With these benchmarks, we could provide guidance on how to best perform several data analysis tasks in different experimental settings.
The single-cell RNA sequencing technologies were, a few years after their discovery, enhanced by the development of multimodal sequencing technologies. Sequencing was focused for decades on targeting one molecule at a time (DNA, RNA, protein,...) but it was a challenge to integrate these data across different experiments. Multimodal technologies allow different data modalities to be analyzed simultaneously, while maintaining single-cell resolution. One of these technologies, CITEseq, allows RNA and cell surface proteins to be studied together. The reasoning behind CITE-seq development is that, in addition to RNA, surface proteins are: i) tightly linked to cell type identity; ii) a closer and more stable proxy for cell activity; and, iii) the main target of most developed drugs. In Manuscript III, I contributed to the analysis of metastasic renal cell cancer samples treated with an immune checkpoint inhibition therapy. Using CITE-seq data, we could distinguish several subpopulations of CD8 and CD4 T lymphocytes and found that several of them were positively reacting to the immunotherapy, information that could be latter used to better understand the cellular mechanisms involved in the positive response to the treatment. The data analysis of such data can be challenging, especially for experimentalists who often rely on bioinformaticians to generate an analysis pipeline. In Manuscript IV, I helped to develop a standardized and flexible analysis pipeline for multimodal data analysis performing classical processing and downstream tasks. Our analysis toolkit provides guidance to analysts new to multimodal analysis but we also provide a tool for experimentalists to facilitate the requirements to analyze their own data. In the last part of the thesis, I focused on the current approaches used to benchmark methods in single-cell and on a novel way to tackle their limitations. In the single-cell field, more than a thousand software tools have been developed to analyze these high dimensional data and most of the publications presenting these tools perform benchmarking according to their own rules and their own appreciation. ‘Neutral’ benchmarks do exist however; they do not present new methods but give a neutral evaluation of the current state in research. In Manuscript V, we performed a meta-analysis of these neutral benchmarks to highlight the current practices and the limitations of these studies. We found that, while data and code are generally available, most studies do not use all available informatic tools developed for reproducibility; containerization, workflow systems or provenance tracking for instance. In Manuscript VI, we present a new benchmarking system that aims to tackle these issues. This system, called ‘Omnibenchmark’, is a collaborative platform for method developers to host their tools and for analysts to find latest recommendations. We hope to lay the foundations for new benchmarking practices and ultimately increase the neutrality, extensibility and re-usability in this field of science.