Abstract
The exponential growth of large-scale datasets in biomedical research, driven by technological advancements, presents significant challenges in data analysis. This thesis explores the application of machine learning (ML) and advanced computational techniques to high-dimensional biomedical data, demonstrating their potential in various domains. The research focuses on four main areas: (1) using ML to combat the COVID-19 pandemic through early detection of SARS-CoV-2 infection and COVID-19 disease outcome prediction, (2) monitoring the spread of SARS-CoV-2 variants at major events based on wastewater sequencing data, (3) developing a robust feature selection pipeline for bulk RNA sequencing (RNA-seq) datasets, and (4) integrating multi-modal data for biomarker discovery in atopic dermatitis (AD).
The first set of studies highlighted the effectiveness of ML models, particularly gradient boosting, in early detection of SARS-CoV-2 infection; and in predicting severe outcomes in COVID-19 patients. These models identified critical predictive features and provided practical decision-support tools for clinical use. The second study addresses the management of COVID-19 pandemic through wastewater-based epidemiology, and showed an effective strategy for an unbiased estimation of the viral spread in the communities.
In the third area, the thesis introduces GeneSelectR, an R package designed for feature selection in large-scale RNA-seq datasets. GeneSelectR combines a number of ML feature selection algorithms and the option to include the results of a traditional differential gene expression analysis with an assessment of the biological relevance of the lists of selected features, offering a comprehensive framework. Its application to large-scale RNA-seq datasets on cancer and AD datasets revealed important transcript subsets, showcasing its utility in the analysis of complex datasets.
The fourth focus area demonstrated the power of ML in integrating multi-modal data to discover biomarkers for AD. By combining RNA-seq, clinical questionnaire, and cytokine profiling data, the study identified robust biomarkers, providing deeper insights into protective and susceptibility features associated with AD.
In conclusion, this thesis demonstrates the potential of ML techniques, offering robust solutions for complex data analysis in the biomedical field.