Options
Prioritizing Variants using Rough-set based Relevance Algorithm for GWAS
Date Issued
2022-01-01
Author(s)
Sharma, Jyoti
Hafeez, Khadija Sana
Paul, Sushmita
DOI
10.1109/CIBCB55180.2022.9863046
Abstract
Genome-wide Association Studies (GWA studies) are performed to identify genetic variants like Single Nucleotide Polymorphisms (SNPs) significantly associated with phenotype in case-control or cohort study designs. GWA studies are based on the fundamental assumption that the most statistically significant variants have a more decisive influence on the phenotype. Thus, most GWA studies use statistical approaches to identify the variants lying below a significant threshold. However, the conventional statistical techniques fail to identify significant variants for complex traits by simply thresholding since the traits are driven by both genetic and environmental factors. Therefore, it is critical to design approaches, which can capture SNPs that significantly affect the complex traits. To address this, several machine learning algorithms are being designed. However, all such techniques face the problem of a low sample to feature ratio creating redundancy and uncertainty in GWA studies. Therefore, a novel pipeline is designed that uses a feature selection step prior to association tests to identify a crisp set of SNPs that are significantly associated with the trait under consideration. The proposed pipeline combines a Rough set-based relevance technique with a machine learning-based association test called Support Vector Regression to identify cholesterol-associated SNPs. The pipeline reduces the number of SNPs to the most relevant SNPs and decreases the time required for association testing. A comparative performance analysis of the proposed approach over other existing approaches is illustrated on the pennCATH cohort dataset through R2 statistics and biological analyses. The proposed pipeline outperforms the other methods. SNP and gene enrichment studies reveal various genes, pathways and biological processes significantly related to cholesterol with the SNPs obtained from the proposed pipeline and establish the fact that the performance of the proposed rough-set-based feature selection method is significantly better.