Skip to main content
  • Methodology Article
  • Open access
  • Published:

Integrative modeling of multi-omics data to identify cancer drivers and infer patient-specific gene activity

Abstract

Background

High throughput technologies have been used to profile genes in multiple different dimensions, such as genetic variation, copy number, gene and protein expression, epigenetics, metabolomics. Computational analyses often treat these different data types as independent, leading to an explosion in the number of features making studies under-powered and more importantly do not provide a comprehensive view of the gene’s state. We sought to infer gene activity by integrating different dimensions using biological knowledge of oncogenes and tumor suppressors.

Results

This paper proposes an integrative model of oncogene and tumor suppressor activity in cells which is used to identify cancer drivers and compute patient-specific gene activity scores. We have developed a Fuzzy Logic Modeling (FLM) framework to incorporate biological knowledge with multi-omics data such as somatic mutation, gene expression and copy number measurements. The advantage of using a fuzzy logic approach is to abstract meaningful biological rules from low-level numerical data. Biological knowledge is often qualitative, thus combining it with quantitative numerical measurements may leverage new biological insights about a gene’s state. We show that the oncogenic and altered tumor suppressing state of a gene can be better characterized by integrating different molecular measurements with biological knowledge than by each data type alone. We validate the gene activity score using data from the Cancer Cell Line Encyclopedia and drug sensitivity data for five compounds: BYL719 (PIK3CA inhibitor), PLX4720 (BRAF inhibitor), AZD6244 (MEK inhibitor), Erlotinib (EGFR inhibitor), and Nutlin-3 (MDM2 inhibitor). The integrative score improves prediction of drug sensitivity for the known drug targets of these compounds compared to each data type alone. The gene activity scores are also used to cluster colorectal cancer cell lines. Two subtypes of CRCs were found and potential cancer drivers and therapeutic targets for each of the subtypes were identified.

Conclusions

We propose a fuzzy logic based approach to infer gene activity in cancer by integrating numerical data with descriptive biological knowledge. We compute general patient-specific gene-level scores useful to determine the oncogenic or tumor suppressor status of cancer gene drivers and to cluster or classify patients.

Background

Cancer is a complex genetic and genomic disease driven by many different molecular mechanisms. Cancer studies have employed high throughput technologies to profile genes in multiple different dimensions. Genetic variation, copy number, gene and protein expression, epigenetics, and metabolomics are the most commonly studied molecular types of data. Computational analyses commonly evaluate multiple data types as a set of independent features [1]. This leads to a multi-fold increase in the number of features. The statistical limitations in high-dimensional data, where the number of samples is considerably smaller than the number of measurements for each sample, is discussed in [2]. Moreover, each data type represents an incomplete snapshot of a biological process and does not provide a comprehensive view of a gene state. In this paper, we propose an integrative approach of multiple data types to model the activity of oncogenes and tumor suppressors and characterize cancer gene drivers.

An oncogene is a gene that has the potential to cause cancer and it is often mutated in tumor cells. A tumor suppressor is a gene that protects a cell from cancer. Oncogenes present gain of function alterations (GoF), while inactivated tumor suppressor genes present loss of function (LoF) alterations. A cancer driver gene contains driver gene mutations, but may also contain passenger gene mutations with no effect in cancer. Driver mutations are identified based on their pattern of mutations across samples [3]. Well studied oncogenes are recurrently mutated at the same amino acid positions, while tumor suppressors are mutated through protein-truncating alterations throughout their length [3].

We will refer, throughout the paper, to a gene’s potential to cause cancer as a gain of function (GoF)/loss of function (LoF) gene activity.

Large studies like the The Cancer Genome Atlas (TCGA) [4], International network of cancer genome projects (ICGC) [5], Cancer Cell Line Encyclopedia (CCLE) [6] have profiled thousands of tumors for many different data types (exome sequencing, copy number, expression, methylations, etc.). Data portals, such as cBioPortal [7], which contain data from these large cancer studies, and which are widely used by the community, display an integrative view of all data types for a given gene. These plots are very powerful as they show a comprehensive view of different mechanisms by which genes can be aberrant in cancer. Our aim in this paper is to develop a computational framework to provide an integrative activity score of a gene.

Other studies have shown that incorporating biological knowledge into model building improves prediction of breast cancer survival [8] and glioblastoma subtypes [9]. Using probabilistic inference, the method proposed in [9] predicts the degree to which the activity of a pathway is altered in a patient. A gene is modeled as a set of interconnected variables which encode for expression, copy number and protein levels.

Other studies have shown that integration of multiple molecular data types may better characterize the disease. TCGA studies [10] and [11] evaluate multiple data types independently to characterize lung cancer. An integrative approach for predicting the tumor suppressor functional status of a gene is presented in [12]. Using CCLE dataset, the authors show that bi-allelic inactivation of tumor suppressors may occur through genetic mechanisms (loss of function mutation, copy number loss, or loss of heterozygosity) or epigenetic mechanisms (promoter methylation or histone modification) or a combination of the two.

Integrative approaches through network based analysis have been previously developed to predict driver genes. An example is OncoIMPACT framework which nominates patient-specific driver genes based on their phenotypic impact [13]. This approach uses gene interaction networks to associate mutations with changes in cell state, such as transcriptome, proteome, epigenome or metabolome. Another example is the analysis pipeline proposed in [14] which integrates genomic and transcriptomic alterations from whole-exome and RNA sequence data and functional data from protein function prediction and gene interaction networks. This method predicts functional implications of mutated potential driver genes found within and across patients with breast cancer.

In this paper we present a novel approach based on Fuzzy Logic Modeling (FLM) to infer patient-specific GoF/LoF gene activity by integrating multiple molecular data types in a single gene-level score. We use matched gene expression, copy number and mutation data from CCLE and integrate them using biological knowledge about oncogenes and tumor suppressors. Other existing approaches identify cancer drivers by assessing only one data type such as mutation frequency [3], or by correlating mutations with other data types or phenotypes. However, the two-dimensional correlation models the relationship of two variables across patients, while the proposed methodology allows integrating any number of data types at the patient level. Moreover, the FLM score is general and independent of a particular group of patients or phenotype in the dataset. Other methods [9, 15] use probabilistic inference to integrate different types of molecular data with pathway-level information in a patient-specific activity score. These methods depend on prior information about the curated pathway and the gene interactions, assuming a local pathway context for a given gene. The method in [15] models the interaction of a mutated gene with the abundance levels of the upstream and downstream genes, while our method captures the global change of a gene based on its own mutation status and abundance level. Differently from the existing approaches, we use descriptive and intuitive knowledge about cancer drivers to combine multiple data types at the gene-level in a unified patient-specific score. The FLM scores are computed for every gene, therefore these could further be integrated at a pathway-level using graphical models, similarly to [9, 15]. The proposed scores can be used to (i) determine the oncogenic or tumor suppressor status of a gene by assessing the sign of the score; (i i) determine the activity level of an oncogene or tumor suppressor by assessing the magnitude of the score; (i i i) classify samples to predict a specific phenotype; (i v) cluster samples to identify subtype specific gene drivers; (v) reduce the feature space by 3-fold to increase the statistical power of sample stratification. Our approach is general, and can be extended by adding other data types and descriptive rules about known biological processes. We validate the proposed methodology by computing gene activity scores to predict drug sensitivity. In addition we classify genes into oncogenes and tumor suppressors and validate the status of known cancer genes. We also use the activity scores to cluster colorectal cancer (CRC) cell lines.

We show that gene activity can be better described by integrating different molecular measurements than by analyzing each data type alone. To the authors’ knowledge, this is the first study in CCLE data showing that the GoF activity of a gene is characterized by a combination of mutation status, expression level and copy number changes. Moreover, by integrating three measurements in a single score, we are able to reduce the feature space by 3-fold, and therefore increase the statistical power of sample stratification. In addition, the proposed gene activity score can be used to highlight potential cancer gene drivers to improve therapeutic strategies.

Method

Fuzzy logic modeling

Fuzzy logic is an artificial intelligence paradigm inspired by how people intuitively measure variables such as temperature, noise, taste. For example, people generally tend to think of temperature in terms of cold/warm/hot categories instead of in degrees. The categories are loosely defined (fuzzy) and subjective, but still meaningful. Among few applications of fuzzy logic are image stabilization, the control of robots and automations, medical decision support systems, etc. Previous studies have shown that fuzzy logic can be used to build robust gene classifiers [16], to eliminate redundant information from microarray data [17] and to detect gene regulatory networks from microarray data [18]. General concepts and applications about fuzzy logic in Bioinformatics can also be found in [19].

In this paper we present a novel approach using fuzzy logic to integrate multiple types of molecular data with expert curated biological rules. We integrate mutation, gene expression and copy number data to generate sample-specific gene activity scores (Fig. 1 a). We chose fuzzy logic as a modeling framework due to its flexibility of incorporating biological knowledge described by high-level rules with data measurements.

Fig. 1
figure 1

Inferring gene activity by integrating different data types and biological knowledge. a Example showing how mutation, copy number and expression data are important for inferring the activity of PIK3CA (oncogene), and PTEN (tumor suppressor). b Schematic for Fuzzy Logic Modeling (FLM)

A fuzzy logic system is a rule-based model in which categories are defined within a numeric range. These categories do not have to be disjoint. Each numeric value is assigned a degree of belief to each category. Categories are defined as functions modeling the degree of uncertainty. These functions are called membership functions and estimate the degree to which numeric values belong to each category. The fuzzy system computes the output by evaluating a set of rules defined on the categories. Once the rules are evaluated, the result is then defuzzified (converted back to a numeric value) by averaging across the output of the rules. In this paper we will use the centroid method for defuzzification. Figure 1 b illustrates the categories, membership functions and workflow of the proposed fuzzy logic model.

For example, gene expression data can be divided into categories such as low, medium and high, which enables us to abstract away from the numeric data and define inference rules based on the categories (Fig. 1 b). The framework is flexible and allows adding any number of data types or rules. The inference rules can easily be extended or modified without changing the meaning of the categories and their numeric range.

The output of the proposed system is a per sample continuous score [–1, 1]. Positive values indicate that the gene presents a gain of function (GoF) with a level of activity between 0 (lowest) and 1 (highest), while negative values indicate that the gene presents a loss of function (LoF) with a level of inactivation between 0 (lowest) and –1 (highest) in a given sample.

We tested our approach on publicly available data that was downloaded from the Cancer Cell Line Encyclopedia (CCLE)Cancer Cell Line Encyclopedia (CCLE) [6]. We included in our analysis 675 cell lines and a number of 1547 protein-coding genes with matched somatic mutation, copy number and gene expression measurements. We used somatic mutation data which was generated via hybrid capture exome sequencing by [6]. About 1600 protein-coding genes were sequenced based on their known or potential involvement in tumor biology, as described in [6]. We also used copy number (genome-wide human Affymetrix SNP Array 6.0) and gene expression (Affymetrix Human Genome U133 Plus 2.0 arrays) normalized data from [6]. More details about generating and normalizing the CCLE data can be found in [6].

JFuzzyJFuzzy, an open source Java-based software, was used to implement and test the proposed fuzzy logic system.

Defining rules based on biological knowledge of oncogenes and tumor suppressors

Rules for oncogenic or gain of function (GoF) activity and tumor suppressor-like or loss of function (LoF) activity are defined. We consider the GoF and LoF types of mutations presented in [3]: a recurrent missense or in-frame indel may indicate a GoF, while a nonsense, nonstop, frame-shift indel or splice site mutation may indicate a LoF. We hypothesize that a gene with a recurrent GoF mutation which also has a higher expression and/or copy number will be more active in cancer compared to lower expression or deletions. For example mutations, as well as amplifications and over-expression of EGFR gene are known to be implicated in many epithelial cancers, such as lung cancer, glioblastoma, colorectal cancer, etc. [20, 21]. Similarly, a gene with a LoF mutation and low expression and/or copy number will be a more active cancer driver than a highly expressed LoF mutant (the non-mutated allele may be expressed in enough amount to perform the gene’s function). An example is TP53 gene which can loose its tumor supressor function either by LoF mutations or by deletions or under-expression [12, 22, 23].

For those genes that have no or silent mutations, we consider that loss in expression or copy number may indicate LoF, while a gain in expression or copy number levels may indicate a GoF activity. However expression and copy number do not contribute symmetrically to LoF and GoF scores because low expression or deletions are more likely to produce a deactivated tumor suppressor, compared to the symmetric situation with high expression or amplification for activated oncogenes. We take this into account in our rules.

The following shows the two extreme rules in our FLM system. We have 28 more rules that define the spectrum between these two extremes (Table 1).

  • GoF rule (rule 1): If a variant is activating (missense, in-frame indel) and is recurrent (same position) and the expression is high and the copy number is amplified then the gene is a very activating GoF.

    Table 1 Fuzzy rules
  • LoF rule (rule 27): If a variant is inactivating (nonsense, nonstop, frame-shift indels, splice site) and the expression is low and copy number is deleted then the gene is a very inactivating LoF.

In the case of TP53 variants, missense mutations (generally GoF mutations) drive the gene’s LoF [3, 12, 2325]. We have added special rules for TP53 to take this into account.

The proposed inference rules are shown in Table 1 and are also available in Additional file 1.

Defining fuzzy categories for data types

For gene expression, we defined three fuzzy categories by dividing the expression values into three quantiles and computing mean and standard deviation (std) of each group. The medium level is defined by a Gaussian membership function. The low and high levels are sigmoid functions with the inflection points equal to the quantile’s mean. The slopes of the sigmoid functions were approximated with the slope of a Gaussian shape: \({slope_{\textit {sigmoid}} = {\sqrt {8 \cdot ln(2)} \over std_{\textit {gaussian}}}}\). For copy number, we defined identical shapes of membership functions for every gene (triangle for neutral level and trapezoid for amplified and deleted levels). Eight types of mutations are considered (missense, in-frame insertion, in-frame deletion, nonsense, nonstop, frame-shift insertion, frame-shift deletion, splice site). A variant is recurrent if it is present in more than 1 % of the samples at the same position. Additional file 1 is an example of a FCL (Fuzzy Control Language)FCL (Fuzzy Control Language) file which contains the definitions of membership functions and fuzzy rules.

Computing activity score for each variant

The activity score is computed for each variant by applying the set of defined rules. The activity score is modeled by seven membership functions (Table 1 and Additional file 1), corresponding to the following activity categories: high GoF (Sigmoid membership function), GoF, low GoF, no effect, low LoF, LoF (Gaussian membership functions) and high LoF (Sigmoid membership function). The membership assignments of the rules are then aggregated by union. The integrated membership assignment is then converted back into a numeric value (deffuzified) using the centroid method. A score in [–1,1] interval is generated; 1 means maximum GoF, while –1, maximum LoF.

Computing gene-level activity scores and labeling of genes as GoF/LoF

A gene usually presents multiple variants. FLM scores are computed for each variant. To summarize the scores at the gene-level we compute two different scores for each gene and each sample: GoF score (maximum across all GoF variants), LoF score (minimum across all LoF variants). Finally, the gene is labeled as GoF/LoF using the following algorithm: if the majority of the mutated samples (>50 %) have G o F>|L o F|, then the gene is labeled as GoF and the GoF percentage is selected as the gene-level score; similarly, if the majority of the mutated samples (>50 %) have |L o F|>G o F, then the gene is labeled as LoF and the LoF percentage is selected as the gene-level score. To restrict the classification to genes that are more likely to be drivers in CCLE, we considered to classify those genes that are mutated in more than 1 % of the samples (a total number of 1288 genes).

Results and discussion

Computing the FLM gene activity scores on the Cancer Cell Line Encyclopedia

We used 675 cell lines with matched gene expression (Affymetrix U133 Plus 2.0), copy number (Affymetrix SNP6.0 arrays) and somatic mutation (hybrid capture exome sequencing) data from the Cancer Cell Line Encyclopedia (CCLE) [6].

We propose a set of inference rules based on biological knowledge that integrate information about mutation type, mutation frequency, level of expression and copy number amplifications or deletions (Table 1 and Additional file 1).

The distribution of the GoF and LoF scores across all analyzed genes and samples is presented in (Fig. 2 a). This distribution is trimodal with peaks at high LoF, no activity and high GoF scores.

Fig. 2
figure 2

Gene activity scores and inferred GoF/LoF status using Fuzzy Logic Modeling. a Distribution of GoF and LoF activity scores across all genes and all samples. b For each gene that presents mutations in CCLE (more than 1 % of the samples), two scores are computed (GoF and LoF gene score). GoF gene score is computed as the percentage of mutated samples with G o F>|L O F|. LoF gene score is computed as the percentage of mutated samples with |L o F|>G o F. A gene is classified as GoF (oncogene) if the GoF gene score is >50 % or as LoF (tumor suppressor) if the LoF gene score is >50 %. c Known oncogenes [3] were correctly predicted by our method with an accuracy of 90 % (19/21). d Known oncogenes [3] were correctly predicted by our method with an accuracy of 86 % (18/21). Note that the known oncogenes and tumor suppressors were restricted to those that were found to be mutated in the CCLE at >1 % frequency

A number of 1288 genes are assigned GoF and LoF gene scores based on the GoF and LoF percentages in mutated samples, as described in “Computing gene-level activity scores and labeling of genes as GoF/LoF”. The gene score >50 % selects the final gene status (Fig. 2 b). The GoF and LoF gene scores are available in Additional file 2.

FLM identifies known oncogenes and tumor suppressors

Next, we compared the proposed GoF/LoF classification with oncogene/tumor suppressor status proposed in [3]. The known oncogenes and tumor suppressors were restricted to those that were found to be mutated in CCLE at >1 % frequency. Figures 2 c and d show that our approach correctly classified well known oncogenes and tumor suppressors that were also correctly classified by the method proposed in [3]. The accuracy for GoF clasification was 90 % (19/21), and 86 % (18/21) for LoF, respectively. We conclude that the FLM-based approach of classifying GoF/LoF is consistent with [3]. The method proposed by [3] computes a gene-level score across variants. In addition, FLM approach provides a per sample gene activity score that can further be used for sample stratification.

Gene activity scores improve drug sensitivity predictions

The proposed fuzzy rules and the computed gene activity scores were validated by predicting drug sensitivity of cell lines in the CCLE. CCLE provides drug perturbation data which allows us to assess gene activity of specific targeted compounds. Although several compounds were tested in CCLE [6], the mechanisms of action and gene targets are unknown for most of them. To validate our approach, we considered those compounds which directly target a driver gene, known to mutate in cancer, or a gene that directly interacts with a mutant driver gene. We selected one target-compound from each family by excluding those with multiple targets and those which do not interact with a specific mutant gene, since their effect may not directly reflect the activity of the driver gene used as a predictor of sensitivity. Therefore, we used in our analysis targeted compounds profiled in [6], such as PLX4720 (BRAF inhibitor), AZD6244 (MEK inhibitor), Erlotinib (EGFR inhibitor), and Nutlin-3 (MDM2 inhibitor). In addition, we analyzed the drug sensitivity data for BYL719 (PIK3CA inhibitor) from [26]. We tested the prediction performance of drug sensitivity for these compounds using their known driver gene targets.

Drug sensitivity of cell lines can be measured by different metrics, such as the area under the dose response curve (A c t i v i t y A r e a or ActArea), high-concentration effect level (A max ), the transitional concentration (E C 50), the concentration at which the drug response reached an absolute inhibition of 50 % (I C 50) or a combination of these metrics [6, 26]. Although these metrics are theoretically equivalent and correlated with the response of the cells to the drug, it has been shown that some of them work better for certain compounds [6]. Moreover, the thresholds which separate the three classes of cell lines, sensitives, intermediates and resistants, are estimated from the response curve and usually differ from one compound to another. Sensitivity calls have already been validated for BYL719 and Nutlin-3 by other published studies. BYL719 sensitive/resistant classes have been defined using a combination of thresholds for A max and E C 50 metrics in [26], while Nutlin-3 sensitive/resistant classes have been defined using thresholds for I C 50 metric in [12]. For the other compounds we estimated the sensitivity calls by applying a Gaussian Mixture Model on ActArea variable, as recommended in [6]. The thresholds that define the two populations of sensitive/resistant cell lines, were computed based on the means (μ 1,μ 2), standard deviations (σ 1,σ 2) and the intersection point of the two Gaussian distributions (X). Resistant cell lines were defined as those with A c t A r e a< min(X,μ 1+σ 1), while the sensitives, as those with A c t A r e a> max(X,μ 2+σ 2).

Table 2 shows the list of compounds, the genes they target, their sensitivity predictor genes, and the thresholds used to define their sensitivity calls. The genes that are able to predict drug sensitivity could be direct targets of a drug or its off-target effects. For example, Nutlin-3 compound inhibits MDM2 gene which is an inhibitor of TP53 [22, 27, 28]. A cell line may be sensitive to Nutlin-3 if TP53 is not mutated (it normally functions as a TSG) and it is expressed in enough amount (TP53 expression is not suppressed by interactions other than MDM2). Figure 4 c shows higher LoF level in the sensitive cell lines.

Table 2 Gene targets and the predictors of sensitivity for the compounds

Evaluating the AUC on FLM activity scores for each sensitivity predictor gene

We computed the Area Under the Receiver Operating Characteristic Curve (AUC) to evaluate the prediction performance of drug sensitivity for each gene predictor and each data type. For each gene, we computed the AUC metric on each data type separately, as well as on the sample-specific FLM scores. We chose AUC as a valid comparison metric of the prediction performance for the different data types because it can be measured on both binary (such as mutation) and continuous (such as gene expression, copy number and FLM activity scores) values. Then, bootstrap test [29] available in pROC R package [30] was applied in order to compare the AUCs. For all the five compounds, the performance of FLM activity scores was improved compared to each of the data types (expression, copy number and mutation) as shown in Figs. 3 and 4. We tested the sensitivity predictor gene of each compound (Table 2). The FLM scores significantly improved prediction compared to all feature types for BYL719 (PIK3CA gene): p<0.05, PLX4720 (BRAF gene): p<0.00002 and Nutlin-3 (TP53 gene): p<0.06. For AZD6244 (BRAF gene) the improvement of FLM score was significant with respect to expression and copy number data (p<0.01), while the increase in performance was not significant compared to mutation data (p=0.22). For Erlotinib (EGFR gene) the improvement of FLM score was significant with respect to mutation and copy number data (p<0.04), while the increase in performance was not significant compared to expression data (p=0.13). In Figs. 3 and 4, the significance level of 0.05 was denoted by *.

Fig. 3
figure 3

FLM gene activity scores improve prediction of BYL719 drug sensitivity compared to using expression, mutation and copy number data separately. a Boxplot for PIK3CA FLM scores vs. BYL719 (PIK3CA inhibitor) sensitivity. BYL719 sensitive group has higher activity scores compared to the resistant group (t-test p <10−4). Even within the PIK3CA missense mutants (colored in red), we see that FLM GoF scores are higher in sensitive compared to resistant group (t-test p <0.0008). b Using PIK3CA FLM GoF scores to predict sensitivity, the AUC significantly improved compared to expression, mutation and copy number data separately, p<0.05. We denote by * the significance level of 0.05. c Heatmap showing the FLM activity scores for PIK3CA, PTEN and the individual data types. All values are scaled between [–1, 1]. Note that our algorithm correctly labeled PIK3CA as a GoF gene, and PTEN as a LoF gene, consistent with their classification in the literature. The color bar on top indicates the sensitivity groups for the samples (green = sensitive, black = resistant). The combined predictor of PIK3CA GoF scores and PTEN LoF scores significantly improves performance compared to combinations of individual data types, p<0.009

Fig. 4
figure 4

FLM gene activity scores differentiate the sensitive vs. resistant groups better than the relevant mutations (colored red) in each compound: a PLX4720, c Nutlin-3, e AZD6244, g Erlotinib. FLM scores improve prediction of drug sensitivity compared to gene expression, somatic mutation and copy number data separately: b PLX4720, p<0.00002, d Nutlin-3, p<0.06, f AZD6244, p<0.22, h Erlotinib using EGFR-KRAS predictor, p<0.01. We denote by * the significance level of 0.05

Comparing the FLM scores with numeric integration of mutation, gene expression and copy number data

Next, we tested whether the FLM scores outperform the numeric integration of the three data types by comparing the performance of the FLM scores with the performance of the three data types used as a set of classification features. Logistic regression is a commonly used technique for classification of samples based on a set of features [3133]. This algorithm aggregates the numerical values of the three data types, regardless of the biological knowledge captured by the fuzzy rules, and learns from the training data to distinguish sensitives from resistants. The logistic regression model (glmnet) was ran within k-fold cross-validation (k=5). Similarly to the cross-validation procedure described in [34], each sample was assigned to a test fold exactly once. Then, we used the test prediction scores of all samples to compute the overall test AUC. We compared the AUC of the glmnet classifier with a single FLM feature to the AUC of the glmnet classifier with three features of different data types by using bootstrap test [29, 30]. The integrative FLM feature performed significantly better compared to the numeric integration of the three features, for AZD6244 (BRAF gene): p=0.02, PLX4720 (BRAF gene): p=0.02 and Erlotinib (EGFR gene): p=0.04. For BYL719 (PIK3CA gene) and Nutlin-3 (TP53 gene) respectively, the difference in AUCs was not statistically significant.

Building multiple genes classifiers on FLM activity scores

A combined predictor for BYL719 involving PIK3CA and PTEN genes was proposed in [26]. Cell lines with PTEN LoF are less likely to respond to the drug. Therefore, to further evaluate the power of FLM activity scores, we test the prediction performance of the two genes on FLM activity scores, mutation, gene expression and copy number data. A logistic regression model (glmnet) was ran within k-fold cross-validation (k=5) to test the AUC of the 2-genes predictors. Similarly to the cross-validation procedure described in [34], each sample was assigned to a test fold exactly once. Then, we used the test prediction scores of all samples to compute the overall test AUC. The AUC of each data type was then compared to the AUC of the FLM scores by using bootstrap test [29, 30]. The PIK3CA-PTEN predictor increased in performance and FLM scores were significantly better than using each of the individual data types, p<0.009.

Next, we performed a similar analysis for Erlotinib, using EGFR and KRAS combined predictors. The prediction was significantly improved by FLM activity scores compared to the other data types, p<0.01 (Fig. 4 h). KRAS is situated downstream of EGFR [35] and cell-lines with KRAS mutation are less likely to respond to Erlotinib. The EGFR-KRAS predictor increased in performance and FLM scores were significantly better than using each of the individual data types.

Therefore, we have shown that FLM is a robust gene-level score which can be used to build better predictors of drug sensitivity compared to those developed on each data type alone. In addition, FLM scores are biologically meaningful and more importantly, the knowledge based information enriches the numeric data measurements.

Using the FLM activity scores to cluster colorectal cancers

In this section we present the utility of FLM activity scores for clustering cancers. FLM integrates multiple data types (somatic mutation, gene expression and copy number) to compute gain/loss of function scores for each gene. The feature space is reduced by 3-fold, therefore clustering cell lines using the FLM scores may enable a more accurate stratification of cancers compared to using each data type separately. Additionally, this also enables us to directly identify cancer drivers of each subtype.

Colorectal cancer is a heterogeneous and genetically complex disease with tumors bearing a high mutation load. To date, no gene expression signature was proven to be reliable for stratification in clinical practice. A recent study of 1100 colorectal cancer patients reported three main molecularly distinct subtypes (CCS1, CCS2 and CCS3) using gene expression data [36]. CCS1 was associated with MSI (micro-satellite instable), CIMP (CpG island methylator phenotype), CCS2 was associated with CIN (chromosomal instability) and KRAS/TP53 mutations, while CCS3 was a novel finding with poor prognosis and was associated with serrated adenomas.

We used FLM scores of the 1547 genes to stratify the colorectal cancer cell lines into clusters based on their activity scores. We ran consensus clustering [37, 38] on 42 colorectal cancer cell lines from the CCLE and identified 2 clusters (Fig. 5 a, b). We compared the FLM clusters with the previously obtained CCS clusters [36]. The heatmap (Fig. 5 c) shows the posterior probability of association with each of the CCS clusters. We can notice a clear overlap between the FLM cluster 2 (green) vs. CCS2. Most likely we do not see a difference between clusters 1 and 3 because of reduced sample size. Another reason may be that FLM identifies oncogenes and tumor suppressors which are active across multiple subtypes. Also, by using activity scores instead of gene expression data we may detect a lineage effect that drives CCS3.

Fig. 5
figure 5

Identifying unsupervized clusters in colorectal cancer and finding differential gene activity within each cluster. a Consensus matrix for K=2,3,4,5, using k-means clustering on colorectal cell lines. The consensus matrices show that there are two distinct subtypes which are stable even when K is increased. b Principal component analysis (PCA) plot of the FLM gene activity scores for 42 colorectal cancer cell lines. Colors indicate the two subtypes found using consensus clustering. c Subtypes found by FLM in CCLE are validated by comparing with subtypes in TCGA [36]. CCS2 is correlated with cluster 2 (green), while cluster 1 is split between CCS1 and CCS3. d Heatmap of the significantly differential gene activity scores (Student’s t-test, F D R<0.05) which differentiate the two FLM subtypes

Then, by using the gene activity scores, we identified statistically significantly differential active genes (Student’s t-test, F D R<0.05) which differentiated the two clusters (Fig. 5 d). We can directly identify cancer drivers from these results. Cluster 1 (red) has suppressed apoptosis (ACVR2A [39], ABL2 [40], PLK2 [41]), while Cluster 2 (green) has activated oncogenes like MDM2 [22] and FGFR3 [42]. We show that using FLM scores we can validate subtypes found in larger studies as well as identify cancer drivers in each of the subtypes.

Conclusions

We propose and validate a novel approach for integrating molecular data to infer patient-specific activity of oncogenes and tumor suppressor genes using fuzzy logic modeling. We show that gain of function or loss of function of a gene can be better characterized by integrating different molecular measurements with biological knowledge than by analyzing each data type separately.

The main advantage of the fuzzy logic framework is that the data level is separated from the inference level, therefore expert knowledge can be incorporated into the system, regardless of the data representation. Another important advantage is the generality of the FLM scores which is computed for each patient and each gene. Therefore, the FLM score can be used to classify genes into oncogenes and tumor suppressors, determine the level of activity for each oncogene and tumor suppressor, and moreover classify or cluster the patients.

FLM is a flexible method to incorporate measurements obtained from different platforms using knowledge based rules. Noise from each individual data type is reduced by categorizing the data, and by integrating multiple data types together. Samples can be stratified using an integrated activity score which better captures the state of each gene. Moreover, reducing the feature space by a factor of three enhances the statistical power of the sample stratification studies, which is especially helpful in data sets of limited sample size.

The novel FLM activity scores improves the prediction of drug response for five different compounds, compared to mutation status, gene expression and copy number data. In addition, we are also able to significantly improve prediction of BYL719 drug sensitivity using a combined predictor of PIK3CA GoF score and PTEN LoF score and for Erlotinib using a combined predictor of EGFR and KRAS GoF scores. Moreover, the performance of FLM scores for clasifying drug responders outperforms the numeric integration of the three different data types for AZD6244, PLX4720 and Erlotinib. Therefore, we conclude that the proposed integrative approach is biologically meaningful and more importantly, the knowledge based information enriches the numeric data measurements.

Another interesting observation revealed by our method is that oncogenic activity of a gene is characterized by both mutation status and its gene expression levels.

We also validate the status of known oncogenes and tumor suppressor genes. Our method highlights additional potential oncogenes and tumor suppressors, which will further be explored. The inference rules can be easily adapted to capture particular gene behaviors that are exceptions to the proposed general set of rules, as we have shown for TP53 case. We plan to further extend the rules and incorporate more expert knowledge, especially for the case of exceptions.

Moreover, we present an application of the FLM scores for sample stratification and subtype discovery. We use the FLM gene scores as clustering features of the colorectal cell lines. We are able to identify colorectal subtypes which were previously obtained in a larger sample size study. We also highlight potential gene drivers associated with each subtype.

The proposed activity scores can be utilized for the prediction of cells’ response to different perturbations such as drugs or shRNAs and may help in identifying cancer driver genes. The patient-specific gene-level scores can also be used for subtype discovery and stratification of the samples, as we have shown with the clustering of colon cancers. The FLM scores can generally be utilized to classify or cluster samples, enhancing the statistical power of such studies by reducing the feature space. Activated oncogenes and inactivated tumor suppressors can be identified within each group.

The proposed framework is flexible and can further be extended to incorporate other data measurements which may provide additional information about the gene’s state, such as the methylation status, the expression of microRNA regulators or transcription factors. In future work we also plan to further explore the FLM scores by studying other CCLE compounds with more complex mechanisms of action, such as multiple downstream or unknown gene targets. In addition, exploring different sensitivity levels and correlating them with FLM scores will be considered. Furthermore, we plan to extend this approach to predict the activity of molecular pathways. For example, the proposed FLM activity scores can be used as the nodes of a gene network. Similar approaches of pathway level integration were proposed in [9] and [15]. We aim to integrate the gene activity scores with pathway-level information to model gene interactions and evaluate the impact of the FLM scores on the activity of molecular pathways.

Availability of supporting data

The results published in this paper are based upon the publicly available data from the Cancer Cell Line Encyclopedia [6] (doi:10.1038/nature1100310.1038/nature11003): http://www.broadinstitute.org/ccle/home.

Abbreviations

AUC:

area under the receiver operating characteristic curve

CCLE:

cancer cell Line encyclopedia

CIN:

chromosomal instability

CRC:

colorectal cancer

FDR:

false discovery rate

FLM:

fuzzy logic modeling

GoF:

gain of function

ICGC:

international network of cancer genome projects

LoF:

loss of function

ONG:

oncogene

TCGA:

the cancer genome atlas

TSG:

tumor suppressor gene

References

  1. The Cancer Genome Atlas Research Network and others. Comprehensive molecular portraits of human breast tumours. Nature. 2012; 490(7418):61–70.

    Article  Google Scholar 

  2. Johnstone IM, Titterington DM. Statistical challenges of high-dimensional data. Philos Trans R Soc Lond A Math Phys Eng Sci. 2009; 367(1906):4237–253. [doi:10.1098/rsta.2009.0159].

    Article  Google Scholar 

  3. Vogelstein B, Papadopoulos N, Velculescu VE, Zhou S, Diaz LA, Kinzler KW. Cancer genome landscapes. Science. 2013; 339(6127):1546–58. [doi:10.1126/science.1235122].

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  4. Weinstein JN, Collisson EA, Mills GB, Shaw KRM, Ozenberger BA, Ellrott K, et al. The cancer genome atlas pan-cancer analysis project. Nat Genet. 2013; 45(10):1113–20.

    Article  PubMed Central  PubMed  Google Scholar 

  5. Hudson TJ, Anderson W, Aretz A, Barker AD, Bell C, Bernabé RR, et al. International network of cancer genome projects. Nature. 2010; 464(7291):993–8.

    Article  CAS  PubMed  Google Scholar 

  6. Barretina J, Caponigro G, Stransky N, Venkatesan K, Margolin AA, Kim S, et al. The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature. 2012; 483(7391):603–7.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  7. Gao J, Aksoy BA, Dogrusoz U, Dresdner G, Gross B, Sumer SO, et al. Integrative analysis of complex cancer genomics and clinical profiles using the cbioportal. Sci Signa. 2013; 6(269):1–1. [doi:10.1126/scisignal.2004088].

    Article  Google Scholar 

  8. Bilal E, Dutkowski J, Guinney J, Jang IS, Logsdon BA, Pandey G, et al. Improving breast cancer survival analysis through competition-based multidimensional modeling. PLoS Comput Biol. 2013; 9(5):1003047. [doi:10.1371/journal.pcbi.1003047].

    Article  Google Scholar 

  9. Vaske CJ, Benz SC, Sanborn JZ, Earl D, Szeto C, Zhu J, et al. Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using paradigm. Bioinformatics. 2010; 26(12):237–45. [doi:10.1093/bioinformatics/btq182].

    Article  Google Scholar 

  10. The Cancer Genome Atlas Research Network and others. Comprehensive genomic characterization of squamous cell lung cancers. Nature. 2012; 489(7417):519–25.

    Article  Google Scholar 

  11. The Cancer Genome Atlas Research Network and others. Comprehensive molecular profiling of lung adenocarcinoma. Nature. 2014; 511(7511):543–50.

    Article  Google Scholar 

  12. Sonkin D, Hassan M, Murphy DJ, Tatarinova TV. Tumor suppressors status in cancer cell line encyclopedia. Mol Oncol. 2013; 7(4):791–8. [doi:10.1016/j.molonc.2013.04.001].

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  13. Bertrand D, Chng KR, Sherbaf FG, Kiesel A, Chia BKH, Sia YY, et al. Patient-specific driver gene prediction and risk assessment through integrated network analysis of cancer omics profiles. Nucl Acids Res. 2015. [doi:10.1093/nar/gku1393].

  14. Suo C, Hrydziuszko O, Lee D, Pramana S, Saputra D, Joshi H, et al. Integration of somatic mutation, expression and functional data reveals potential driver genes predictive of breast cancer survival. Bioinformatics. 2015; 31(16):2607–13.

    Article  PubMed  Google Scholar 

  15. Ng S, Collisson EA, Sokolov A, Goldstein T, Gonzalez-Perez A, Lopez-Bigas N, et al. Paradigm-shift predicts the function of mutations in multiple cancers using pathway impact analysis. Bioinformatics. 2012; 28(18):640–6. [doi:10.1093/bioinformatics/bts402].

    Article  Google Scholar 

  16. Schaefer G, Nakashima T, Yokota Y, Ishibuchi H. Fuzzy classification of gene expression data. In: IEEE International Fuzzy Systems Conference (FUZZ-IEEE): 2007. p. 1–6, doi:10.1109/FUZZY.2007.4295519.

  17. Huerta E, Duval B, Hao JK. Fuzzy logic for elimination of redundant information of microarray data. Genomics Proteomics Bioinforma. 2008; 6(2):61–73. [doi:10.1016/S1672-0229(08)60021-2].

    Article  CAS  Google Scholar 

  18. Brock GN, Pihur V, Kubatko L. Detecting gene regulatory networks from microarray data using fuzzy logic In: Jin Y, Wang L, editors. Fuzzy Systems in Bioinformatics and Computational Biology. Vol. 242 of the series Studies in Fuzziness and Soft Computing. Berlin, Germany: Springer: 2009. p. 141–163.

    Google Scholar 

  19. Xu D, Keller JM, Popescu M, Bondugula R. In: Applications of Fuzzy Logic in Bioinformatics. London, UK: Imperial College Press: 2008. 225 pages.

  20. Walker F, Abramowitz L, Benabderrahmane D, Duval X, Descatoire V, Hénin D, et al. Growth factor receptor expression in anal squamous lesions: modifications associated with oncogenic human papillomavirus and human immunodeficiency virus. 2009; 40(11):1517–27. [doi:10.1016/j.humpath.2009.05.010].

  21. Douillard J, Pirker R, O’Byrne K, Kerr K, Störkel S, von Heydebreck A, et al. Relationship between egfr expression, egfr mutation status, and the efficacy of chemotherapy plus cetuximab in flex study patients with advanced non-small-cell lung cancer. 2014; 9(5):717–24. [doi:10.1097/JTO.0000000000000141].

  22. Kubbutat MH, Jones SN, Vousden KH. Regulation of p53 stability by mdm2. Nature. 1997; 387(6630):299–303.

    Article  CAS  PubMed  Google Scholar 

  23. Ponder BA. Cancer genetics. Nature. 2001; 411(6835):336–41.

    Article  CAS  PubMed  Google Scholar 

  24. Hollstein M, Hainaut P. Massively regulated genes: the example of tp53. J Pathol. 2010; 220(2):164–73. [doi:10.1002/path.2637].

    CAS  PubMed  Google Scholar 

  25. Payne SR, Kemp CJ. Tumor suppressor genetics. Carcinogenesis. 2005; 26(12):2031–45. [doi:10.1093/carcin/bgi223].

    Article  CAS  PubMed  Google Scholar 

  26. Fritsch C, Huang A, Chatenay-Rivauday C, Schnell C, Reddy A, Liu M, et al. Characterization of the novel and specific pi3ka inhibitor nvp-byl719 and development of the patient stratification strategy for clinical trials. Mol Cancer Ther. 2014; 13(5):1117–29. [doi:10.1158/1535-7163.MCT-13-0865].

    Article  CAS  PubMed  Google Scholar 

  27. Vassilev LT, Vu BT, Graves B, Carvajal D, Podlaski F, Filipovic Z, et al. In vivo activation of the p53 pathway by small-molecule antagonists of mdm2. Science. 2004; 303(5659):844–8. [doi:10.1126/science.1092472].

    Article  CAS  PubMed  Google Scholar 

  28. Efeyan A, Ortega-Molina A, Velasco-Miguel S, Herranz D, Vassilev LT, Serrano M. Induction of p53-dependent senescence by the mdm2 antagonist nutlin-3a in mouse cells of fibroblast origin. Cancer Res. 2007; 67(15):7350–7. [doi:10.1158/0008-5472.CAN-07-0200].

    Article  CAS  PubMed  Google Scholar 

  29. Hanley JA, McNeil BJ. A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology. 1983; 148(3):839–43. [doi:10.1148/radiology.148.3.6878708].

    Article  CAS  PubMed  Google Scholar 

  30. Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC, et al. proc: an open-source package for r and s+ to analyze and compare roc curves. BMC Bioinforma. 2011; 12(1):77. [doi:10.1186/1471-2105-12-77].

    Article  Google Scholar 

  31. Geeleher P, Cox NJ, Huang RS. Clinical drug response can be predicted using baseline gene expression levels and in vitro drug sensitivity in cell lines. Genome Biol. 2014;15(R47). doi:10.1186/gb-2014-15-3-r47.

  32. Barlow WE, White E, Ballard-Barbash R, Vacek PM, Titus-Ernstoff L, Carney PA, et al. Prospective breast cancer risk prediction model for women undergoing screening mammography. JNCI J Natl Cancer Inst. 2006; 98(17):1204–14. [doi:10.1093/jnci/djj331].

    Article  PubMed  Google Scholar 

  33. Chhatwal J, Alagoz O, Lindstrom MJ, Kahn CE, Shaffer KA, Burnside ES. A logistic regression model based on the national mammography database format to aid breast cancer diagnosis. Am J Roentgenol. 2014; 192(4):1117–27. [doi:10.2214/AJR.07.3345].

    Article  Google Scholar 

  34. Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC, et al. Panelomix: A threshold-based algorithm to create panels of biomarkers. Adv Integr Med. 2013; 1(1):57–64. [doi:10.1016/j.trprot.2013.04.003].

    CAS  Google Scholar 

  35. Berg M, Soreide K. Egfr and downstream genetic alterations in kras/braf and pi3k/akt pathways in colorectal cancer – implications for targeted therapy. Discov Med. 2012; 14(76):207–14.

    PubMed  Google Scholar 

  36. Felipe De Sousa EM, Wang X, Jansen M, Fessler E, Trinh A, de Rooij LP, et al. Poor-prognosis colon cancer is defined by a molecularly distinct subtype and develops from serrated precursor lesions. Nat Med. 2013; 19(5):614–8.

    Article  Google Scholar 

  37. Monti S, Tamayo P, Mesirov J, Golub T. Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn. 2003; 52(1–2):91–118. [doi:10.1023/A:1023949509487].

    Article  Google Scholar 

  38. Wilkerson MD, Hayes DN. Consensusclusterplus: a class discovery tool with confidence assessments and item tracking. Bioinformatics. 2010; 26(12):1572–3. [doi:10.1093/bioinformatics/btq170].

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  39. The Cancer Genome Atlas Research Network and others. Comprehensive molecular characterization of human colon and rectal cancer. Nature. 2012; 487(7407):330–7.

    Article  Google Scholar 

  40. Bozic I, Antal T, Ohtsuki H, Carter H, Kim D, Chen S, et al. Accumulation of driver and passenger mutations during tumor progression. Proc Natl Acad Sci. 2010; 107(43):18545–50. [doi:10.1073/pnas.1010978107].

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  41. Coley H, Hatzmichael E, Blagden S, McNeish I, Thompson A, Crook, T, et al. Polo like kinase 2 tumour suppressor and cancer biomarker: new perspectives on drug sensitivity/resistance in ovarian cancer. Oncotarget. 2012; 3(1):78–83.

    Article  PubMed Central  PubMed  Google Scholar 

  42. Chesi M, Brents LA, Ely SA, Bais C, Robbiani DF, Mesri EA, et al. Activated fibroblast growth factor receptor 3 is an oncogene that contributes to tumor progression in multiple myeloma. Blood. 2001; 97(3):729–36. [doi:10.1182/blood.V97.3.729].

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

We thank Joseph Lehar and Manway Liu for their valuable suggestions and ideas to this study. We acknowledge the Graduate Program in Bioinformatics at Boston University for support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ana B. Pavel.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

ABP contributed to the design, implementation and validation of the proposed approach, definition of the rules, drug sensitivity data analysis, and writing the manuscript. DS contributed to the definition of the rules and provided useful suggestions to the entire paper. AR contributed to the development of the proposed approach, drug sensitivity data analysis, CRC cell-lines analysis, and writing the manuscript. All authors have read and approved the final version of the manuscript.

Additional files

Additional file 1

An example of the defined membership functions and general fuzzy rules in FCL (Fuzzy Control Language) format. (FCL 8.15 kb)

Additional file 2

GoF and LoF gene scores for those genes that are mutated in more than 1 % of the samples. (CSV 37.9 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pavel, A.B., Sonkin, D. & Reddy, A. Integrative modeling of multi-omics data to identify cancer drivers and infer patient-specific gene activity. BMC Syst Biol 10, 16 (2016). https://0-doi-org.brum.beds.ac.uk/10.1186/s12918-016-0260-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://0-doi-org.brum.beds.ac.uk/10.1186/s12918-016-0260-9

Keywords