Use Case 10: Reactome Pathways¶

Enrichment analysis helps to focus on a subset of genes that participate in the same pathway from a larger gene set. Visualizing the pathway can aid further examination. In this use case, we demonstrate how to visualize results on Reactome Pathways using python packages.

Step 1: Importing packages¶

Various Python packages are needed to download data from cptac, conduct statistical analysis to find meaningful protein clusters, and overlay pathway diagrams found on Reactome. The cptac.utils package offers a plethora of functions for these tasks. You can find more details by using help(cptac.utils).

In [1]:

import cptac
import pandas as pd
import cptac.utils as utils
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels as statsmodels
from IPython.display import Image

Step 2: Acquiring Data¶

Even after importing the cptac package, we still need to load specific datasets. Here, we load the Renal Cancer proteomics data.

In [2]:

ds = cptac.Ccrcc()
df = ds.get_proteomics('umich')
df.head()

cptac warning: Your version of cptac (1.5.1) is out-of-date. Latest is 1.5.0. Please run 'pip install --upgrade cptac' to update it. (C:\Users\sabme\anaconda3\lib\threading.py, line 910)

Out[2]:

Name	ARF5	M6PR	ESRRA	FKBP4	NDUFAF7	FUCA2	DBNDD1	CFTR	CYP51A1	USP28	...	ANK2		ATXN2	ETNK1	MYO6	GBF1	CTNND1	PRX	WIZ	HSPA12A
Database_ID	ENSP00000000233.5	ENSP00000000412.3	ENSP00000000442.6	ENSP00000001008.4	ENSP00000002125.4	ENSP00000002165.5	ENSP00000002501.6	ENSP00000003084.6	ENSP00000003100.8	ENSP00000003302.4	...	ENSP00000500245.1	ENSP00000500275.1	ENSP00000500313.1	ENSP00000500633.1	ENSP00000500710.1	ENSP00000501064.1	ENSP00000501126.1	ENSP00000501261.1	ENSP00000501300.1	ENSP00000501491.1
Patient_ID
C3L-00004	0.248796	0.308289	-0.315825	0.051528	-0.323475	-0.720420	NaN	-0.832085	-0.629139	-0.317251	...	0.359093	1.609902	0.054346	-0.471825	-0.095824	0.240579	0.689392	NaN	-0.072279	-0.655139
C3L-00010	0.245087	0.131992	-0.016811	-0.114728	-0.816992	0.104260	NaN	1.266034	-0.573329	-0.172712	...	-1.592004	-0.542294	NaN	-0.208478	-0.143536	0.049889	0.499405	1.422138	-0.058976	-0.189293
C3L-00011	0.576821	0.443678	-0.544452	0.120776	-0.075035	-0.738826	NaN	-0.484732	-0.725274	1.079570	...	-0.523301	-0.389680	NaN	0.550085	-1.617542	0.485815	-0.327129	NaN	0.344999	-0.688779
C3L-00026	0.293042	0.250350	0.142924	-0.173812	0.589458	0.238597	0.957386	-3.178976	0.372489	0.404721	...	-0.249875	-1.492455	NaN	0.005406	-0.281776	0.057764	-0.073197	NaN	-0.159555	-0.023920
C3L-00079	0.174315	0.026805	-0.663191	0.075943	-0.240099	-0.293380	NaN	0.008966	-0.469829	NaN	...	-1.423105	-0.921139	NaN	-0.042932	-0.921862	0.385364	NaN	NaN	0.387207	-1.057636

5 rows × 11889 columns

Utilizing utils, we can pull data from a range of large databases. In this example, the Hugo Gene Nomenclature database is queried for a list of proteins in the proteosome.

In [3]:

hgnc_lists = utils.get_hgnc_protein_lists()
proteasome_proteins = sorted(set(hgnc_lists["Proteasome"]))
included_cols = df.columns.get_level_values(0).intersection(proteasome_proteins)
included_cols

Out[3]:

Index(['PSMA4', 'PSMC4', 'PSME1', 'PSMD5', 'PSMD8', 'PSMA3', 'PSME2', 'PSMD10',
       'PSMD7', 'PSMF1', 'ADRM1', 'PSMB7', 'PSMC1', 'PSMA6', 'PSMD11', 'PSMB1',
       'PSMD3', 'PSMB6', 'PSMA5', 'PSMB4', 'PSMC2', 'PSME3', 'PSMD6', 'PSMC3',
       'PSMD1', 'PSMD2', 'PSMC5', 'PSMD12', 'PSMB10', 'PSMB5', 'PSMD4',
       'PSMA7', 'PSMB2', 'PSMB9', 'PSMB8', 'PSMA1', 'PSME4', 'PSMD14', 'PSMC6',
       'PSMD13', 'PSMD9', 'PSMB3'],
      dtype='object', name='Name')

Step 3: Identifying Differences between Tumor and Normal Conditions¶

We loop through the list of proteins, separating tumor and normal data from each other. A t-test is then performed on both data lists. For each protein with a p-value less than .05, the protein name, p-value, and the difference between tumor mean and normal mean is stored in separate lists.

In [4]:

nameList =[]
pVal_list = []
deltaList = []
selected_proteins = df[included_cols]

for protein in selected_proteins.columns:
    data = selected_proteins[protein]
    normal = data[data.index.str.endswith('.N')]
    tumor = data[~data.index.str.endswith('.N')]
    pVal = stats.ttest_ind(normal, tumor, equal_var = False, nan_policy='omit').pvalue 
    
    if pVal < .05:
        name = protein[0]
        delta = tumor.mean() - normal.mean()
        nameList.append(name)
        pVal_list.append(pVal)
        deltaList.append(delta)

We apply the Benjamini & Hochberg/Yekutieli correction to the list of p-values calculated in the previous cell. A dataframe is then constructed using the lists of names, p-values, and differences between tumor mean and normal mean.

In [5]:

reject_null, adj_p, alpha_sidak, alpha_bonf = statsmodels.stats.multitest.multipletests(pVal_list, alpha=0.05, method='fdr_bh')
pVal_list = adj_p
zippedList =  list(zip(nameList, pVal_list, deltaList))
dfObj = pd.DataFrame(zippedList, columns =  ['Name', 'P-Value', 'Delta'])
dfObj = dfObj.set_index('Name')

We then simplify the abundance change (Delta column). An increase is assigned 1, a decrease is assigned -1, and no change is set at 0.

In [6]:

dfObj.insert(1, "simplified_change", dfObj["Delta"])
dfObj["simplified_change"][dfObj["Delta"] > 0] = 1
dfObj["simplified_change"][dfObj["Delta"] < 0] = -1
dfObj["simplified_change"][dfObj["Delta"] == 0] = 0

Using the Benjamini & Hochberg/Yekutieli correction function we are able to also evaluate whether or not to reject the null. This is evaluated and added as a column in the dataframe.

In [7]:

dfObj = dfObj.assign(adj_p=adj_p, reject_null=reject_null)
sig_res = dfObj[dfObj["reject_null"]]
sig_genes = dfObj.index.get_level_values(0).tolist()
sig_res

Out[7]:

	P-Value	simplified_change	Delta	adj_p	reject_null
Name
PSMA4	1.911611e-10	1.0	0.148198	1.911611e-10	True
PSMC4	1.391607e-09	1.0	0.112355	1.391607e-09	True
PSME1	2.008155e-13	1.0	0.316319	2.008155e-13	True
PSMD5	2.182605e-11	1.0	0.217871	2.182605e-11	True
PSMD8	7.052978e-09	1.0	0.138417	7.052978e-09	True
PSMA3	1.075243e-12	1.0	0.177574	1.075243e-12	True
PSME2	8.868141e-36	1.0	0.671231	8.868141e-36	True
PSMD10	1.370257e-56	1.0	0.468575	1.370257e-56	True
PSMD7	1.926944e-05	1.0	0.080727	1.926944e-05	True
PSMF1	2.042107e-05	1.0	0.184903	2.042107e-05	True
PSMF1	1.563592e-27	1.0	0.381929	1.563592e-27	True
ADRM1	4.086117e-03	1.0	0.068020	4.086117e-03	True
PSMB7	1.087052e-08	-1.0	-0.263777	1.087052e-08	True
PSMC1	2.837536e-18	1.0	0.169574	2.837536e-18	True
PSMA6	4.235036e-24	1.0	0.277224	4.235036e-24	True
PSMD11	4.601922e-04	1.0	0.064925	4.601922e-04	True
PSMB1	1.976143e-29	1.0	0.278219	1.976143e-29	True
PSMD3	2.011238e-12	1.0	0.141735	2.011238e-12	True
PSMB6	5.243052e-27	-1.0	-0.716784	5.243052e-27	True
PSMA5	2.114399e-35	1.0	0.302883	2.114399e-35	True
PSMB4	1.921437e-21	1.0	0.255068	1.921437e-21	True
PSMC2	4.943515e-10	1.0	0.109437	4.943515e-10	True
PSME3	8.587448e-06	-1.0	-0.129204	8.587448e-06	True
PSMD6	9.883283e-13	1.0	0.130206	9.883283e-13	True
PSMC3	3.781102e-15	1.0	0.139832	3.781102e-15	True
PSMD1	1.100341e-05	1.0	0.082433	1.100341e-05	True
PSMD1	2.889130e-08	1.0	0.509608	2.889130e-08	True
PSMD2	1.523451e-03	1.0	0.060321	1.523451e-03	True
PSMC5	3.164878e-16	1.0	0.155371	3.164878e-16	True
PSMD12	1.703326e-37	1.0	0.334370	1.703326e-37	True
PSMB10	4.819896e-58	1.0	1.266577	4.819896e-58	True
PSMB5	2.031051e-40	-1.0	-1.015462	2.031051e-40	True
PSMD4	2.790907e-03	-1.0	-0.090527	2.790907e-03	True
PSMD4	3.220494e-09	1.0	0.118229	3.220494e-09	True
PSMA7	1.114386e-25	1.0	0.235643	1.114386e-25	True
PSMB2	2.936813e-20	1.0	0.225433	2.936813e-20	True
PSMB9	1.075926e-75	1.0	1.145988	1.075926e-75	True
PSMB8	1.082528e-66	1.0	0.877224	1.082528e-66	True
PSMA1	1.743304e-25	1.0	0.258769	1.743304e-25	True
PSME4	1.536922e-02	1.0	0.086871	1.536922e-02	True
PSMD14	1.004902e-11	1.0	0.125789	1.004902e-11	True
PSMC6	8.132824e-13	1.0	0.118492	8.132824e-13	True
PSMD13	9.849506e-10	1.0	0.127654	9.849506e-10	True
PSMD9	2.760906e-02	1.0	0.065384	2.760906e-02	True
PSMB3	1.100460e-03	1.0	0.092246	1.100460e-03	True

Step 4: Display data with Reactome¶

Within utils from the cptac package we can use a function called "reactome_pathway_overlay" to query reactome and overlay the proteosome data we have found to be significant. A detailed guide on using this function can be read by calling help(utils.reactome_pathway_overlay). In this example I have chosen to overlay my data with the R-HSA-1236978 pathway ID. Determining desired pathway ID's in relation to your specific data can be done in a variety of methods such as gene enrichment analysis. Finally, with the quality parameter set at 10 you can zoom in on areas of the pathway to see further detail.

In [8]:

expression_vals, image_path = utils.reactome_pathway_overlay(
    'R-HSA-1236978', 
    df=dfObj['simplified_change'], 
    analysis_token=None, 
    open_browser=False,
    export_path="test.png", 
    image_format='png', 
    display_col_idx=None,                      
    diagram_colors='Modern', 
    overlay_colors='Standard', 
    quality=10)
Image(image_path)

Out[8]: