Use Case 10: Reactome Pathways¶

Enrichment analysis helps to focus on a subset of genes that participate in the same pathway from a larger gene set. Visualizing the pathway can aid further examination. In this use case, we demonstrate how to visualize results on Reactome Pathways using python packages.

Step 1: Importing packages¶

Various Python packages are needed to download data from cptac, conduct statistical analysis to find meaningful protein clusters, and overlay pathway diagrams found on Reactome. The cptac.utils package offers a plethora of functions for these tasks. You can find more details by using help(cptac.utils).

In [1]:
import cptac
import pandas as pd
import cptac.utils as utils
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels as statsmodels
from IPython.display import Image

Step 2: Acquiring Data¶

Even after importing the cptac package, we still need to load specific datasets. Here, we load the Renal Cancer proteomics data.

In [2]:
ds = cptac.Ccrcc()
df = ds.get_proteomics('umich')
df.head()
cptac warning: Your version of cptac (1.5.1) is out-of-date. Latest is 1.5.0. Please run 'pip install --upgrade cptac' to update it. (C:\Users\sabme\anaconda3\lib\threading.py, line 910)
Out[2]:
Name ARF5 M6PR ESRRA FKBP4 NDUFAF7 FUCA2 DBNDD1 CFTR CYP51A1 USP28 ... ANK2 ATXN2 ETNK1 MYO6 GBF1 CTNND1 PRX WIZ HSPA12A
Database_ID ENSP00000000233.5 ENSP00000000412.3 ENSP00000000442.6 ENSP00000001008.4 ENSP00000002125.4 ENSP00000002165.5 ENSP00000002501.6 ENSP00000003084.6 ENSP00000003100.8 ENSP00000003302.4 ... ENSP00000500245.1 ENSP00000500275.1 ENSP00000500313.1 ENSP00000500633.1 ENSP00000500710.1 ENSP00000501064.1 ENSP00000501126.1 ENSP00000501261.1 ENSP00000501300.1 ENSP00000501491.1
Patient_ID
C3L-00004 0.248796 0.308289 -0.315825 0.051528 -0.323475 -0.720420 NaN -0.832085 -0.629139 -0.317251 ... 0.359093 1.609902 0.054346 -0.471825 -0.095824 0.240579 0.689392 NaN -0.072279 -0.655139
C3L-00010 0.245087 0.131992 -0.016811 -0.114728 -0.816992 0.104260 NaN 1.266034 -0.573329 -0.172712 ... -1.592004 -0.542294 NaN -0.208478 -0.143536 0.049889 0.499405 1.422138 -0.058976 -0.189293
C3L-00011 0.576821 0.443678 -0.544452 0.120776 -0.075035 -0.738826 NaN -0.484732 -0.725274 1.079570 ... -0.523301 -0.389680 NaN 0.550085 -1.617542 0.485815 -0.327129 NaN 0.344999 -0.688779
C3L-00026 0.293042 0.250350 0.142924 -0.173812 0.589458 0.238597 0.957386 -3.178976 0.372489 0.404721 ... -0.249875 -1.492455 NaN 0.005406 -0.281776 0.057764 -0.073197 NaN -0.159555 -0.023920
C3L-00079 0.174315 0.026805 -0.663191 0.075943 -0.240099 -0.293380 NaN 0.008966 -0.469829 NaN ... -1.423105 -0.921139 NaN -0.042932 -0.921862 0.385364 NaN NaN 0.387207 -1.057636

5 rows × 11889 columns

Utilizing utils, we can pull data from a range of large databases. In this example, the Hugo Gene Nomenclature database is queried for a list of proteins in the proteosome.

In [3]:
hgnc_lists = utils.get_hgnc_protein_lists()
proteasome_proteins = sorted(set(hgnc_lists["Proteasome"]))
included_cols = df.columns.get_level_values(0).intersection(proteasome_proteins)
included_cols
Out[3]:
Index(['PSMA4', 'PSMC4', 'PSME1', 'PSMD5', 'PSMD8', 'PSMA3', 'PSME2', 'PSMD10',
       'PSMD7', 'PSMF1', 'ADRM1', 'PSMB7', 'PSMC1', 'PSMA6', 'PSMD11', 'PSMB1',
       'PSMD3', 'PSMB6', 'PSMA5', 'PSMB4', 'PSMC2', 'PSME3', 'PSMD6', 'PSMC3',
       'PSMD1', 'PSMD2', 'PSMC5', 'PSMD12', 'PSMB10', 'PSMB5', 'PSMD4',
       'PSMA7', 'PSMB2', 'PSMB9', 'PSMB8', 'PSMA1', 'PSME4', 'PSMD14', 'PSMC6',
       'PSMD13', 'PSMD9', 'PSMB3'],
      dtype='object', name='Name')

Step 3: Identifying Differences between Tumor and Normal Conditions¶

We loop through the list of proteins, separating tumor and normal data from each other. A t-test is then performed on both data lists. For each protein with a p-value less than .05, the protein name, p-value, and the difference between tumor mean and normal mean is stored in separate lists.

In [4]:
nameList =[]
pVal_list = []
deltaList = []
selected_proteins = df[included_cols]

for protein in selected_proteins.columns:
    data = selected_proteins[protein]
    normal = data[data.index.str.endswith('.N')]
    tumor = data[~data.index.str.endswith('.N')]
    pVal = stats.ttest_ind(normal, tumor, equal_var = False, nan_policy='omit').pvalue 
    
    if pVal < .05:
        name = protein[0]
        delta = tumor.mean() - normal.mean()
        nameList.append(name)
        pVal_list.append(pVal)
        deltaList.append(delta)

We apply the Benjamini & Hochberg/Yekutieli correction to the list of p-values calculated in the previous cell. A dataframe is then constructed using the lists of names, p-values, and differences between tumor mean and normal mean.

In [5]:
reject_null, adj_p, alpha_sidak, alpha_bonf = statsmodels.stats.multitest.multipletests(pVal_list, alpha=0.05, method='fdr_bh')
pVal_list = adj_p
zippedList =  list(zip(nameList, pVal_list, deltaList))
dfObj = pd.DataFrame(zippedList, columns =  ['Name', 'P-Value', 'Delta'])
dfObj = dfObj.set_index('Name')

We then simplify the abundance change (Delta column). An increase is assigned 1, a decrease is assigned -1, and no change is set at 0.

In [6]:
dfObj.insert(1, "simplified_change", dfObj["Delta"])
dfObj["simplified_change"][dfObj["Delta"] > 0] = 1
dfObj["simplified_change"][dfObj["Delta"] < 0] = -1
dfObj["simplified_change"][dfObj["Delta"] == 0] = 0

Using the Benjamini & Hochberg/Yekutieli correction function we are able to also evaluate whether or not to reject the null. This is evaluated and added as a column in the dataframe.

In [7]:
dfObj = dfObj.assign(adj_p=adj_p, reject_null=reject_null)
sig_res = dfObj[dfObj["reject_null"]]
sig_genes = dfObj.index.get_level_values(0).tolist()
sig_res
Out[7]:
P-Value simplified_change Delta adj_p reject_null
Name
PSMA4 1.911611e-10 1.0 0.148198 1.911611e-10 True
PSMC4 1.391607e-09 1.0 0.112355 1.391607e-09 True
PSME1 2.008155e-13 1.0 0.316319 2.008155e-13 True
PSMD5 2.182605e-11 1.0 0.217871 2.182605e-11 True
PSMD8 7.052978e-09 1.0 0.138417 7.052978e-09 True
PSMA3 1.075243e-12 1.0 0.177574 1.075243e-12 True
PSME2 8.868141e-36 1.0 0.671231 8.868141e-36 True
PSMD10 1.370257e-56 1.0 0.468575 1.370257e-56 True
PSMD7 1.926944e-05 1.0 0.080727 1.926944e-05 True
PSMF1 2.042107e-05 1.0 0.184903 2.042107e-05 True
PSMF1 1.563592e-27 1.0 0.381929 1.563592e-27 True
ADRM1 4.086117e-03 1.0 0.068020 4.086117e-03 True
PSMB7 1.087052e-08 -1.0 -0.263777 1.087052e-08 True
PSMC1 2.837536e-18 1.0 0.169574 2.837536e-18 True
PSMA6 4.235036e-24 1.0 0.277224 4.235036e-24 True
PSMD11 4.601922e-04 1.0 0.064925 4.601922e-04 True
PSMB1 1.976143e-29 1.0 0.278219 1.976143e-29 True
PSMD3 2.011238e-12 1.0 0.141735 2.011238e-12 True
PSMB6 5.243052e-27 -1.0 -0.716784 5.243052e-27 True
PSMA5 2.114399e-35 1.0 0.302883 2.114399e-35 True
PSMB4 1.921437e-21 1.0 0.255068 1.921437e-21 True
PSMC2 4.943515e-10 1.0 0.109437 4.943515e-10 True
PSME3 8.587448e-06 -1.0 -0.129204 8.587448e-06 True
PSMD6 9.883283e-13 1.0 0.130206 9.883283e-13 True
PSMC3 3.781102e-15 1.0 0.139832 3.781102e-15 True
PSMD1 1.100341e-05 1.0 0.082433 1.100341e-05 True
PSMD1 2.889130e-08 1.0 0.509608 2.889130e-08 True
PSMD2 1.523451e-03 1.0 0.060321 1.523451e-03 True
PSMC5 3.164878e-16 1.0 0.155371 3.164878e-16 True
PSMD12 1.703326e-37 1.0 0.334370 1.703326e-37 True
PSMB10 4.819896e-58 1.0 1.266577 4.819896e-58 True
PSMB5 2.031051e-40 -1.0 -1.015462 2.031051e-40 True
PSMD4 2.790907e-03 -1.0 -0.090527 2.790907e-03 True
PSMD4 3.220494e-09 1.0 0.118229 3.220494e-09 True
PSMA7 1.114386e-25 1.0 0.235643 1.114386e-25 True
PSMB2 2.936813e-20 1.0 0.225433 2.936813e-20 True
PSMB9 1.075926e-75 1.0 1.145988 1.075926e-75 True
PSMB8 1.082528e-66 1.0 0.877224 1.082528e-66 True
PSMA1 1.743304e-25 1.0 0.258769 1.743304e-25 True
PSME4 1.536922e-02 1.0 0.086871 1.536922e-02 True
PSMD14 1.004902e-11 1.0 0.125789 1.004902e-11 True
PSMC6 8.132824e-13 1.0 0.118492 8.132824e-13 True
PSMD13 9.849506e-10 1.0 0.127654 9.849506e-10 True
PSMD9 2.760906e-02 1.0 0.065384 2.760906e-02 True
PSMB3 1.100460e-03 1.0 0.092246 1.100460e-03 True

Step 4: Display data with Reactome¶

Within utils from the cptac package we can use a function called "reactome_pathway_overlay" to query reactome and overlay the proteosome data we have found to be significant. A detailed guide on using this function can be read by calling help(utils.reactome_pathway_overlay). In this example I have chosen to overlay my data with the R-HSA-1236978 pathway ID. Determining desired pathway ID's in relation to your specific data can be done in a variety of methods such as gene enrichment analysis. Finally, with the quality parameter set at 10 you can zoom in on areas of the pathway to see further detail.

In [8]:
expression_vals, image_path = utils.reactome_pathway_overlay(
    'R-HSA-1236978', 
    df=dfObj['simplified_change'], 
    analysis_token=None, 
    open_browser=False,
    export_path="test.png", 
    image_format='png', 
    display_col_idx=None,                      
    diagram_colors='Modern', 
    overlay_colors='Standard', 
    quality=10)
Image(image_path)
Out[8]:
No description has been provided for this image