Introduction to CASCAM package
instruction.Rmd
This package is written around a S4 object CASCAM
, where
the analysis results including the figures are stored. For now, our
framework is only limited to the bulk RNA-Seq data, and it is a
three-step framework which should be run sequentially.
- Step 1: Data preparation
- Step 2: Genome-wide cancer model pre-selection
- Step 3: Pathway specific exploration
In this tutorial, we use breast cancer (BRCA) analysis as an example. There are two breast cancer histology subtypes – invasive lobular carcinoma (ILC) and invasive (infiltrating) ductal carcinoma (IDC), and ILC is of interest in this study.
Data preparation
In the data preparation step, the following data is necessary –
-
tumor_log_tpm
: A data set of tumor gene expression on log scale (log2-TPM preferred). It is highly recommend to merge the interested tumor samples with the pan-cancer dataset (such as TCGA) to benefit from the large sample size, which has shown the outperformance comparing with the small sample size. -
tumor_label
: A vector of sample labels of the tumor log tpm data (binary class, interested v.s. uninterested). -
camod_log_tpm
: A data set of cancer model gene expression on log scale (log2-TPM preferred). Similarly, better performance can also come from the larger sample size. -
tumor_ct
: A data set of the un-normalized (estimated) counts of sequencing reads or fragments from the tumor data. -
tumor_label2
: A vector of sample labels of the tumor count data (binary class, interested v.s. uninterested).
Tumor and cancer model alignment
Celligner is applied in our framework to allow the tumors and cancer models directly comparable. Users can refer to this paper for more information. To simplify the process, we use the default parameters setting in Celligner paper. This step may take some time (~ 30 min).
aligned <- Celligner(tumor_log_tpm, camod_log_tpm)
Celligner
returns a Seurat object with the aligned data
included, and the aligned data for tumor and cancer models of interest
(tumor_aligned
and camod_aligned
) can be
extracted.
Differential expression analysis
Informative genes are then detected by using the tumor count matrix.
DESeq2
and fgsea
are implied in our framework
to identify the differential expression genes and the enriched pathways
(Hallmark and KEGG).
gene_info <- create_InformativeGenes(tumor_ct, tumor_label2, "ILC")
Create CASCAM object
A CASCAM
is then created for the downstream
analysis.
CASCAM_eg <- create_CASCAM(tumor_aligned, tumor_label, camod_aligned, gene_info)
Genome-wide cancer model pre-selection
Sparse discriminant analysis (SDA) model trainining
The first step in this section is to train the SDA model for (1) high dimensionality; (2) data sparsity; (3) classification assignment probabilities.
CASCAM_eg <- sda_model(CASCAM_eg)
By default, 1/20 of the genes are selected for the model, and the L2-norm parameter is pre-set to 1e-4. Users can provide their own choics or train the model through cross-validation. Parallel computation is available in our cross-validation function.
CASCAM_eg <- sda_model_cv(CASCAM_eg)
Starting from here, users can store the CASCAM_eg
object
into an RData file and call run_CASCAM_app()
for
interactive exploration. The next step is to calculate the
genome-related measurements.
CASCAM_eg <- genome_selection(CASCAM_eg)
CASCAM_eg <- genome_selection_visualize(CASCAM_eg)
Users can use SDA projected positions
CASCAM_eg@genome_figure$sda_project_position
and the
confidence intervals CASCAM_eg@genome_figure$sda_ds_ci_rank
to evaluate the cancer models visually.
Pathway specific exploration
Pathway specific exploration is performed based on genome-wide pre-selected cancer models. Pathway-related measurements are at first calculated.
CASCAM_eg <- pathway_analysis(CASCAM_eg)
The downstream exploration then goes as follows –
-
pathway_congruence_heatmap
: Explore the performance of selected cancer models across all the available pathways using pathway congruence heatmap. -
pathway_specific_heatmap
: Select the interested pathway and analyze the performance of selected cancer models across the genes within that pathway using pathway specific heatmap. -
pathway_specific_ridgeline
: Determine the subset of selected cancer models to visualize their positions for each gene within that pathway using pathway specific ridgeline . -
pathview_analysis
: Choose one cancer model and visualize its topological structure for final decision (if the pathway is from KEGG) using pathview.
It is highly recommended to performance the exploration using the Shiny app as it provides better experience of iterative exploration.