Use Case 10: Reactome Pathways¶
Enrichment analysis helps to focus on a subset of genes that participate in the same pathway from a larger gene set. Visualizing the pathway can aid further examination. In this use case, we demonstrate how to visualize results on Reactome Pathways using python packages.
Step 1: Importing packages¶
Various Python packages are needed to download data from cptac, conduct statistical analysis to find meaningful protein clusters, and overlay pathway diagrams found on Reactome. The cptac.utils package offers a plethora of functions for these tasks. You can find more details by using help(cptac.utils).
import cptac
import pandas as pd
import cptac.utils as utils
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels as statsmodels
from IPython.display import Image
Step 2: Acquiring Data¶
Even after importing the cptac package, we still need to load specific datasets. Here, we load the Renal Cancer proteomics data.
ds = cptac.Ccrcc()
df = ds.get_proteomics('umich')
df.head()
cptac warning: Your version of cptac (1.5.1) is out-of-date. Latest is 1.5.0. Please run 'pip install --upgrade cptac' to update it. (C:\Users\sabme\anaconda3\lib\threading.py, line 910)
Name | ARF5 | M6PR | ESRRA | FKBP4 | NDUFAF7 | FUCA2 | DBNDD1 | CFTR | CYP51A1 | USP28 | ... | ANK2 | ATXN2 | ETNK1 | MYO6 | GBF1 | CTNND1 | PRX | WIZ | HSPA12A | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Database_ID | ENSP00000000233.5 | ENSP00000000412.3 | ENSP00000000442.6 | ENSP00000001008.4 | ENSP00000002125.4 | ENSP00000002165.5 | ENSP00000002501.6 | ENSP00000003084.6 | ENSP00000003100.8 | ENSP00000003302.4 | ... | ENSP00000500245.1 | ENSP00000500275.1 | ENSP00000500313.1 | ENSP00000500633.1 | ENSP00000500710.1 | ENSP00000501064.1 | ENSP00000501126.1 | ENSP00000501261.1 | ENSP00000501300.1 | ENSP00000501491.1 |
Patient_ID | |||||||||||||||||||||
C3L-00004 | 0.248796 | 0.308289 | -0.315825 | 0.051528 | -0.323475 | -0.720420 | NaN | -0.832085 | -0.629139 | -0.317251 | ... | 0.359093 | 1.609902 | 0.054346 | -0.471825 | -0.095824 | 0.240579 | 0.689392 | NaN | -0.072279 | -0.655139 |
C3L-00010 | 0.245087 | 0.131992 | -0.016811 | -0.114728 | -0.816992 | 0.104260 | NaN | 1.266034 | -0.573329 | -0.172712 | ... | -1.592004 | -0.542294 | NaN | -0.208478 | -0.143536 | 0.049889 | 0.499405 | 1.422138 | -0.058976 | -0.189293 |
C3L-00011 | 0.576821 | 0.443678 | -0.544452 | 0.120776 | -0.075035 | -0.738826 | NaN | -0.484732 | -0.725274 | 1.079570 | ... | -0.523301 | -0.389680 | NaN | 0.550085 | -1.617542 | 0.485815 | -0.327129 | NaN | 0.344999 | -0.688779 |
C3L-00026 | 0.293042 | 0.250350 | 0.142924 | -0.173812 | 0.589458 | 0.238597 | 0.957386 | -3.178976 | 0.372489 | 0.404721 | ... | -0.249875 | -1.492455 | NaN | 0.005406 | -0.281776 | 0.057764 | -0.073197 | NaN | -0.159555 | -0.023920 |
C3L-00079 | 0.174315 | 0.026805 | -0.663191 | 0.075943 | -0.240099 | -0.293380 | NaN | 0.008966 | -0.469829 | NaN | ... | -1.423105 | -0.921139 | NaN | -0.042932 | -0.921862 | 0.385364 | NaN | NaN | 0.387207 | -1.057636 |
5 rows × 11889 columns
Utilizing utils, we can pull data from a range of large databases. In this example, the Hugo Gene Nomenclature database is queried for a list of proteins in the proteosome.
hgnc_lists = utils.get_hgnc_protein_lists()
proteasome_proteins = sorted(set(hgnc_lists["Proteasome"]))
included_cols = df.columns.get_level_values(0).intersection(proteasome_proteins)
included_cols
Index(['PSMA4', 'PSMC4', 'PSME1', 'PSMD5', 'PSMD8', 'PSMA3', 'PSME2', 'PSMD10', 'PSMD7', 'PSMF1', 'ADRM1', 'PSMB7', 'PSMC1', 'PSMA6', 'PSMD11', 'PSMB1', 'PSMD3', 'PSMB6', 'PSMA5', 'PSMB4', 'PSMC2', 'PSME3', 'PSMD6', 'PSMC3', 'PSMD1', 'PSMD2', 'PSMC5', 'PSMD12', 'PSMB10', 'PSMB5', 'PSMD4', 'PSMA7', 'PSMB2', 'PSMB9', 'PSMB8', 'PSMA1', 'PSME4', 'PSMD14', 'PSMC6', 'PSMD13', 'PSMD9', 'PSMB3'], dtype='object', name='Name')
Step 3: Identifying Differences between Tumor and Normal Conditions¶
We loop through the list of proteins, separating tumor and normal data from each other. A t-test is then performed on both data lists. For each protein with a p-value less than .05, the protein name, p-value, and the difference between tumor mean and normal mean is stored in separate lists.
nameList =[]
pVal_list = []
deltaList = []
selected_proteins = df[included_cols]
for protein in selected_proteins.columns:
data = selected_proteins[protein]
normal = data[data.index.str.endswith('.N')]
tumor = data[~data.index.str.endswith('.N')]
pVal = stats.ttest_ind(normal, tumor, equal_var = False, nan_policy='omit').pvalue
if pVal < .05:
name = protein[0]
delta = tumor.mean() - normal.mean()
nameList.append(name)
pVal_list.append(pVal)
deltaList.append(delta)
We apply the Benjamini & Hochberg/Yekutieli correction to the list of p-values calculated in the previous cell. A dataframe is then constructed using the lists of names, p-values, and differences between tumor mean and normal mean.
reject_null, adj_p, alpha_sidak, alpha_bonf = statsmodels.stats.multitest.multipletests(pVal_list, alpha=0.05, method='fdr_bh')
pVal_list = adj_p
zippedList = list(zip(nameList, pVal_list, deltaList))
dfObj = pd.DataFrame(zippedList, columns = ['Name', 'P-Value', 'Delta'])
dfObj = dfObj.set_index('Name')
We then simplify the abundance change (Delta column). An increase is assigned 1, a decrease is assigned -1, and no change is set at 0.
dfObj.insert(1, "simplified_change", dfObj["Delta"])
dfObj["simplified_change"][dfObj["Delta"] > 0] = 1
dfObj["simplified_change"][dfObj["Delta"] < 0] = -1
dfObj["simplified_change"][dfObj["Delta"] == 0] = 0
Using the Benjamini & Hochberg/Yekutieli correction function we are able to also evaluate whether or not to reject the null. This is evaluated and added as a column in the dataframe.
dfObj = dfObj.assign(adj_p=adj_p, reject_null=reject_null)
sig_res = dfObj[dfObj["reject_null"]]
sig_genes = dfObj.index.get_level_values(0).tolist()
sig_res
P-Value | simplified_change | Delta | adj_p | reject_null | |
---|---|---|---|---|---|
Name | |||||
PSMA4 | 1.911611e-10 | 1.0 | 0.148198 | 1.911611e-10 | True |
PSMC4 | 1.391607e-09 | 1.0 | 0.112355 | 1.391607e-09 | True |
PSME1 | 2.008155e-13 | 1.0 | 0.316319 | 2.008155e-13 | True |
PSMD5 | 2.182605e-11 | 1.0 | 0.217871 | 2.182605e-11 | True |
PSMD8 | 7.052978e-09 | 1.0 | 0.138417 | 7.052978e-09 | True |
PSMA3 | 1.075243e-12 | 1.0 | 0.177574 | 1.075243e-12 | True |
PSME2 | 8.868141e-36 | 1.0 | 0.671231 | 8.868141e-36 | True |
PSMD10 | 1.370257e-56 | 1.0 | 0.468575 | 1.370257e-56 | True |
PSMD7 | 1.926944e-05 | 1.0 | 0.080727 | 1.926944e-05 | True |
PSMF1 | 2.042107e-05 | 1.0 | 0.184903 | 2.042107e-05 | True |
PSMF1 | 1.563592e-27 | 1.0 | 0.381929 | 1.563592e-27 | True |
ADRM1 | 4.086117e-03 | 1.0 | 0.068020 | 4.086117e-03 | True |
PSMB7 | 1.087052e-08 | -1.0 | -0.263777 | 1.087052e-08 | True |
PSMC1 | 2.837536e-18 | 1.0 | 0.169574 | 2.837536e-18 | True |
PSMA6 | 4.235036e-24 | 1.0 | 0.277224 | 4.235036e-24 | True |
PSMD11 | 4.601922e-04 | 1.0 | 0.064925 | 4.601922e-04 | True |
PSMB1 | 1.976143e-29 | 1.0 | 0.278219 | 1.976143e-29 | True |
PSMD3 | 2.011238e-12 | 1.0 | 0.141735 | 2.011238e-12 | True |
PSMB6 | 5.243052e-27 | -1.0 | -0.716784 | 5.243052e-27 | True |
PSMA5 | 2.114399e-35 | 1.0 | 0.302883 | 2.114399e-35 | True |
PSMB4 | 1.921437e-21 | 1.0 | 0.255068 | 1.921437e-21 | True |
PSMC2 | 4.943515e-10 | 1.0 | 0.109437 | 4.943515e-10 | True |
PSME3 | 8.587448e-06 | -1.0 | -0.129204 | 8.587448e-06 | True |
PSMD6 | 9.883283e-13 | 1.0 | 0.130206 | 9.883283e-13 | True |
PSMC3 | 3.781102e-15 | 1.0 | 0.139832 | 3.781102e-15 | True |
PSMD1 | 1.100341e-05 | 1.0 | 0.082433 | 1.100341e-05 | True |
PSMD1 | 2.889130e-08 | 1.0 | 0.509608 | 2.889130e-08 | True |
PSMD2 | 1.523451e-03 | 1.0 | 0.060321 | 1.523451e-03 | True |
PSMC5 | 3.164878e-16 | 1.0 | 0.155371 | 3.164878e-16 | True |
PSMD12 | 1.703326e-37 | 1.0 | 0.334370 | 1.703326e-37 | True |
PSMB10 | 4.819896e-58 | 1.0 | 1.266577 | 4.819896e-58 | True |
PSMB5 | 2.031051e-40 | -1.0 | -1.015462 | 2.031051e-40 | True |
PSMD4 | 2.790907e-03 | -1.0 | -0.090527 | 2.790907e-03 | True |
PSMD4 | 3.220494e-09 | 1.0 | 0.118229 | 3.220494e-09 | True |
PSMA7 | 1.114386e-25 | 1.0 | 0.235643 | 1.114386e-25 | True |
PSMB2 | 2.936813e-20 | 1.0 | 0.225433 | 2.936813e-20 | True |
PSMB9 | 1.075926e-75 | 1.0 | 1.145988 | 1.075926e-75 | True |
PSMB8 | 1.082528e-66 | 1.0 | 0.877224 | 1.082528e-66 | True |
PSMA1 | 1.743304e-25 | 1.0 | 0.258769 | 1.743304e-25 | True |
PSME4 | 1.536922e-02 | 1.0 | 0.086871 | 1.536922e-02 | True |
PSMD14 | 1.004902e-11 | 1.0 | 0.125789 | 1.004902e-11 | True |
PSMC6 | 8.132824e-13 | 1.0 | 0.118492 | 8.132824e-13 | True |
PSMD13 | 9.849506e-10 | 1.0 | 0.127654 | 9.849506e-10 | True |
PSMD9 | 2.760906e-02 | 1.0 | 0.065384 | 2.760906e-02 | True |
PSMB3 | 1.100460e-03 | 1.0 | 0.092246 | 1.100460e-03 | True |
Step 4: Display data with Reactome¶
Within utils from the cptac package we can use a function called "reactome_pathway_overlay" to query reactome and overlay the proteosome data we have found to be significant. A detailed guide on using this function can be read by calling help(utils.reactome_pathway_overlay)
.
In this example I have chosen to overlay my data with the R-HSA-1236978 pathway ID. Determining desired pathway ID's in relation to your specific data can be done in a variety of methods such as gene enrichment analysis.
Finally, with the quality parameter set at 10 you can zoom in on areas of the pathway to see further detail.
expression_vals, image_path = utils.reactome_pathway_overlay(
'R-HSA-1236978',
df=dfObj['simplified_change'],
analysis_token=None,
open_browser=False,
export_path="test.png",
image_format='png',
display_col_idx=None,
diagram_colors='Modern',
overlay_colors='Standard',
quality=10)
Image(image_path)