Tutorial 3: Joining dataframes with cptac¶

In this tutorial, we provide several examples of how to use the built-in cptac functions for joining different dataframes.

We will do this on data for Endometrial carcinoma. First we need to import the package and create an endometrial data object, which we call 'en'.

In [1]:
# Start by importing the cptac package
import cptac

# Create an endometrial data object, named 'en'
en = cptac.Ucec()

# List the available data sources
en.list_data_sources()
Out[1]:
Data type Available sources
0 CNV [bcm, washu]
1 circular_RNA [bcm]
2 miRNA [bcm, washu]
3 proteomics [bcm, umich]
4 transcriptomics [bcm, broad, washu]
5 ancestry_prediction [harmonized]
6 somatic_mutation [harmonized, washu]
7 clinical [mssm]
8 follow-up [mssm]
9 medical_history [mssm]
10 acetylproteomics [umich]
11 phosphoproteomics [umich]
12 cibersort [washu]
13 hla_typing [washu]
14 tumor_purity [washu]
15 xcell [washu]

en.list_data_sources() shows the types of data available in the dataset and their respective sources. For example, you see proteomics data is available from umich, transcriptomics data from bcm, broad, washu and so forth.

In [2]:
# Retrieve the transcriptomics data from bcm
bcm_data = en.get_transcriptomics('bcm')

# Display the first few rows of the dataframe
bcm_data.head()
Out[2]:
Name A1BG A1BG-AS1 A1CF A2M A2M-AS1 A2ML1 A2ML1-AS1 A2ML1-AS2 A2MP1 A3GALT2 ... ZXDB ZXDC ZYG11A ZYG11AP1 ZYG11B ZYX ZYXP1 ZZEF1 hsa-mir-1253 hsa-mir-423
Database_ID ENSG00000121410.12 ENSG00000268895.6 ENSG00000148584.15 ENSG00000175899.15 ENSG00000245105.4 ENSG00000166535.20 ENSG00000256661.1 ENSG00000256904.1 ENSG00000256069.7 ENSG00000184389.9 ... ENSG00000198455.4 ENSG00000070476.15 ENSG00000203995.10 ENSG00000232242.2 ENSG00000162378.13 ENSG00000159840.16 ENSG00000274572.1 ENSG00000074755.15 ENSG00000272920.1 ENSG00000266919.3
Patient_ID
C3L-00006 2.54 5.11 3.60 13.75 6.45 7.08 1.80 0.00 2.60 1.16 ... 10.17 10.61 5.54 0.0 11.85 10.60 0.0 11.87 0.0 0.0
C3L-00008 4.40 4.63 5.49 13.89 6.61 6.97 0.00 2.74 3.25 0.00 ... 9.79 10.48 7.79 0.0 12.28 11.28 0.0 11.93 0.0 0.0
C3L-00032 4.83 7.26 3.73 14.48 6.91 9.56 0.98 0.00 3.26 0.00 ... 9.43 9.97 6.48 0.0 11.72 10.37 0.0 11.70 0.0 0.0
C3L-00084 4.73 6.01 5.37 15.17 7.93 3.86 0.00 0.00 3.73 1.15 ... 9.23 10.37 7.47 0.0 11.86 10.13 0.0 11.19 0.0 0.0
C3L-00090 4.14 6.24 5.69 13.87 6.79 4.32 0.00 0.00 3.23 0.00 ... 9.69 9.64 7.60 0.0 11.98 10.31 0.0 11.45 0.0 0.0

5 rows × 59286 columns

In the above code, get_transcriptomics('bcm') is used to retrieve the transcriptomics data from bcm. Each row represents a different patient, and each column corresponds to a different gene.

General format¶

cptac has a helpful function called multi_join. It allows data from several different cptac dataframes to be joined at the same time.

To use multi_join, you specify the dataframes you want to join by passing a dictionary of their names to the function call. The function will automatically check that the dataframes whose names you provided are valid for the join function, and print an error message if they aren't.

Whenever a column from an -omics dataframe is included in a joined table, the name of the -omics dataframe it came from is joined to the column header, to avoid confusion.

If you wish to only include particular columns in the join, include them as values in the dictionary. All values will accept either a single column name as a string, or a list of column name strings. In this use case, we will usually only select specific columns for readability, but you could select the whole dataframe in all these cases, except for the mutations dataframe.

The join functions use logic analogous to an SQL INNER JOIN.

Join dictionary¶

The main parameter for the multi_join function is a dictionary with source and datatype as a key, and specific columns as a value. Because there are multiple sources for each datatype, the desired source needs to be included. This can be done in two different ways. The first is by using a string that contains the source, a space, and then the datatype. The second is by using a tuple formatted (source, datatype). For example, using:

{('umich', 'proteomics'): ''}

or

{"umich proteomics": ''}

as the join dictionary would each result in multi_join returning a dataframe containing only awg proteomics data.

You'll notice the value in the key:value pair is an empty string. Because a dictionary needs to have a value for each key, the empty string or an empty list mean we want everything from the specified dataframe. If a string or list of strings is specified, the joined dataframe will only contain the specified columns. See below for more examples.

Join omics to omics¶

multi_join can join two -omics dataframes to each other. Types of -omics data valid for use with this function are acetylproteomics, CNV, phosphoproteomics, phosphoproteomics_gene, proteomics, and transcriptomics.

In [3]:
# Joining two -omics dataframes together using multi_join
prot_and_tran = en.multi_join({"umich proteomics":'', "bcm transcriptomics":''})
prot_and_tran.head()
cptac warning: Your version of cptac (1.5.1) is out-of-date. Latest is 1.5.0. Please run 'pip install --upgrade cptac' to update it. (C:\Users\sabme\anaconda3\lib\threading.py, line 910)
Out[3]:
Name ARF5_umich_proteomics M6PR_umich_proteomics ESRRA_umich_proteomics FKBP4_umich_proteomics NDUFAF7_umich_proteomics FUCA2_umich_proteomics DBNDD1_umich_proteomics SEMA3F_umich_proteomics CFTR_umich_proteomics CYP51A1_umich_proteomics ... ZXDB_bcm_transcriptomics ZXDC_bcm_transcriptomics ZYG11A_bcm_transcriptomics ZYG11AP1_bcm_transcriptomics ZYG11B_bcm_transcriptomics ZYX_bcm_transcriptomics ZYXP1_bcm_transcriptomics ZZEF1_bcm_transcriptomics hsa-mir-1253_bcm_transcriptomics hsa-mir-423_bcm_transcriptomics
Database_ID ENSP00000000233.5 ENSP00000000412.3 ENSP00000000442.6 ENSP00000001008.4 ENSP00000002125.4 ENSP00000002165.5 ENSP00000002501.6 ENSP00000002829.3 ENSP00000003084.6 ENSP00000003100.8 ... ENSG00000198455.4 ENSG00000070476.15 ENSG00000203995.10 ENSG00000232242.2 ENSG00000162378.13 ENSG00000159840.16 ENSG00000274572.1 ENSG00000074755.15 ENSG00000272920.1 ENSG00000266919.3
Patient_ID
C3L-00006 -0.056513 0.016557 0.002569 0.389819 0.603610 -0.332543 -0.790426 NaN 0.822732 0.039134 ... 10.17 10.61 5.54 0.0 11.85 10.60 0.0 11.87 0.0 0.0
C3L-00008 0.549959 -0.206129 0.905784 -0.303631 0.018767 0.503513 0.950955 0.080142 NaN -0.063213 ... 9.79 10.48 7.79 0.0 12.28 11.28 0.0 11.93 0.0 0.0
C3L-00032 0.088681 -0.154447 -0.190515 0.170753 0.196356 0.544194 -0.179078 NaN NaN 0.377405 ... 9.43 9.97 6.48 0.0 11.72 10.37 0.0 11.70 0.0 0.0
C3L-00084 -0.846555 0.027740 NaN 0.178700 0.264054 -0.183548 0.077215 -0.247164 0.152277 -0.279549 ... 9.23 10.37 7.47 0.0 11.86 10.13 0.0 11.19 0.0 0.0
C3L-00090 0.539019 0.956619 -0.039516 0.323656 0.064605 0.173433 -0.524325 -0.038590 -0.311486 0.309905 ... 9.69 9.64 7.60 0.0 11.98 10.31 0.0 11.45 0.0 0.0

5 rows × 71948 columns

In this example, multi_join is used to join proteomics data from umich and transcriptomics data from bcm into one combined dataframe.

In [4]:
# Using multi_join with specified columns
prot_and_tran_selected = en.multi_join({"umich proteomics":'ARF5', "bcm transcriptomics":'A1BG'})
prot_and_tran_selected.head()
Out[4]:
Name ARF5_umich_proteomics A1BG_bcm_transcriptomics
Database_ID ENSP00000000233.5 ENSG00000121410.12
Patient_ID
C3L-00006 -0.056513 2.54
C3L-00008 0.549959 4.40
C3L-00032 0.088681 4.83
C3L-00084 -0.846555 4.73
C3L-00090 0.539019 4.14

Here, multi_join is used again, but this time only the 'ARF5' column from the proteomics data and the 'A1BG' column from the transcriptomics data are included in the resulting dataframe.

Join metadata to omics¶

The multi_join function can also join a metadata dataframe (e.g. clinical or derived_molecular) with an -omics dataframe:

In [5]:
# Join a metadata dataframe with an -omics dataframe
clin_and_tran = en.multi_join({"mssm clinical":'', "bcm transcriptomics":''})
clin_and_tran.head()
Out[5]:
Name tumor_code discovery_study type_of_analyzed_samples_mssm_clinical confirmatory_study type_of_analyzed_samples_mssm_clinical age sex race ethnicity ethnicity_race_ancestry_identified ... ZXDB_bcm_transcriptomics ZXDC_bcm_transcriptomics ZYG11A_bcm_transcriptomics ZYG11AP1_bcm_transcriptomics ZYG11B_bcm_transcriptomics ZYX_bcm_transcriptomics ZYXP1_bcm_transcriptomics ZZEF1_bcm_transcriptomics hsa-mir-1253_bcm_transcriptomics hsa-mir-423_bcm_transcriptomics
Database_ID ... ENSG00000198455.4 ENSG00000070476.15 ENSG00000203995.10 ENSG00000232242.2 ENSG00000162378.13 ENSG00000159840.16 ENSG00000274572.1 ENSG00000074755.15 ENSG00000272920.1 ENSG00000266919.3
Patient_ID
C3L-00006 UCEC Yes Tumor_and_Normal NaN NaN 64 Female White Not Hispanic or Latino White ... 10.17 10.61 5.54 0.0 11.85 10.60 0.0 11.87 0.0 0.0
C3L-00008 UCEC Yes Tumor NaN NaN 58 Female White Not Hispanic or Latino White ... 9.79 10.48 7.79 0.0 12.28 11.28 0.0 11.93 0.0 0.0
C3L-00032 UCEC Yes Tumor NaN NaN 50 Female White Not Hispanic or Latino White ... 9.43 9.97 6.48 0.0 11.72 10.37 0.0 11.70 0.0 0.0
C3L-00084 UCEC Yes Tumor NaN NaN 74 Female White Not Hispanic or Latino White ... 9.23 10.37 7.47 0.0 11.86 10.13 0.0 11.19 0.0 0.0
C3L-00090 UCEC Yes Tumor NaN NaN 75 Female White Not Hispanic or Latino White ... 9.69 9.64 7.60 0.0 11.98 10.31 0.0 11.45 0.0 0.0

5 rows × 59410 columns

Joining only specific columns:

In [6]:
clin_and_tran = en.multi_join({"mssm clinical": ["age", "Overall survival, days"], "bcm transcriptomics": ["ZYX", 'ZZEF1']})
clin_and_tran.head()
Out[6]:
Name age Overall survival, days ZYX_bcm_transcriptomics ZZEF1_bcm_transcriptomics
Database_ID ENSG00000159840.16 ENSG00000074755.15
Patient_ID
C3L-00006 64 737.0 10.60 11.87
C3L-00008 58 898.0 11.28 11.93
C3L-00032 50 1710.0 10.37 11.70
C3L-00084 74 335.0 10.13 11.19
C3L-00090 75 1281.0 10.31 11.45

Join metadata to metadata¶

Of course two metadata dataframes (e.g. clinical or derived_molecular) can also be joined together. Note how we passed a column name to select from the clinical dataframe, but passing an empty string '' or an empty list [] for the column parameter for the derived_molecular dataframe caused the entire dataframe to be selected.

In [7]:
clin_and_tran = en.multi_join({
    "mssm clinical": "",
    "bcm transcriptomics": '' # Note that by using an empty string or list as the value, we join the entire dataframe
})

clin_and_tran.head()
Out[7]:
Name tumor_code discovery_study type_of_analyzed_samples_mssm_clinical confirmatory_study type_of_analyzed_samples_mssm_clinical age sex race ethnicity ethnicity_race_ancestry_identified ... ZXDB_bcm_transcriptomics ZXDC_bcm_transcriptomics ZYG11A_bcm_transcriptomics ZYG11AP1_bcm_transcriptomics ZYG11B_bcm_transcriptomics ZYX_bcm_transcriptomics ZYXP1_bcm_transcriptomics ZZEF1_bcm_transcriptomics hsa-mir-1253_bcm_transcriptomics hsa-mir-423_bcm_transcriptomics
Database_ID ... ENSG00000198455.4 ENSG00000070476.15 ENSG00000203995.10 ENSG00000232242.2 ENSG00000162378.13 ENSG00000159840.16 ENSG00000274572.1 ENSG00000074755.15 ENSG00000272920.1 ENSG00000266919.3
Patient_ID
C3L-00006 UCEC Yes Tumor_and_Normal NaN NaN 64 Female White Not Hispanic or Latino White ... 10.17 10.61 5.54 0.0 11.85 10.60 0.0 11.87 0.0 0.0
C3L-00008 UCEC Yes Tumor NaN NaN 58 Female White Not Hispanic or Latino White ... 9.79 10.48 7.79 0.0 12.28 11.28 0.0 11.93 0.0 0.0
C3L-00032 UCEC Yes Tumor NaN NaN 50 Female White Not Hispanic or Latino White ... 9.43 9.97 6.48 0.0 11.72 10.37 0.0 11.70 0.0 0.0
C3L-00084 UCEC Yes Tumor NaN NaN 74 Female White Not Hispanic or Latino White ... 9.23 10.37 7.47 0.0 11.86 10.13 0.0 11.19 0.0 0.0
C3L-00090 UCEC Yes Tumor NaN NaN 75 Female White Not Hispanic or Latino White ... 9.69 9.64 7.60 0.0 11.98 10.31 0.0 11.45 0.0 0.0

5 rows × 59410 columns

Join many datatypes together¶

If you need data from three or more dataframes, they can all simply be added to the joining dictionary. The only limit to the number of dataframes the joining dictionary parameter for multi_join can take is your imagination.

In [8]:
joining_dictionary = {"umich proteomics": "ARF5", "bcm transcriptomics": "A1BG", "mssm clinical": [], "washu somatic_mutation": []}
en.multi_join(joining_dictionary).head()
Out[8]:
Name ARF5_umich_proteomics A1BG_bcm_transcriptomics tumor_code discovery_study type_of_analyzed_samples_mssm_clinical confirmatory_study type_of_analyzed_samples_mssm_clinical age sex race ... additional_treatment_immuno_for_new_tumor number_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_loco-regional number_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_metastasis Recurrence-free survival, days Recurrence-free survival from collection, days Recurrence status (1, yes; 0, no) Overall survival, days Overall survival from collection, days Survival status (1, dead; 0, alive) Sample_Status
Database_ID ENSP00000000233.5 ENSG00000121410.12 ...
Patient_ID
C3L-00006 -0.056513 2.54 UCEC Yes Tumor_and_Normal NaN NaN 64 Female White ... NaN NaN NaN NaN NaN 0.0 737.0 737.0 0.0 Tumor
C3L-00008 0.549959 4.40 UCEC Yes Tumor NaN NaN 58 Female White ... NaN NaN NaN NaN NaN 0.0 898.0 898.0 0.0 Tumor
C3L-00032 0.088681 4.83 UCEC Yes Tumor NaN NaN 50 Female White ... NaN NaN NaN NaN NaN 0.0 1710.0 1710.0 0.0 Tumor
C3L-00084 -0.846555 4.73 UCEC Yes Tumor NaN NaN 74 Female White ... NaN NaN NaN NaN NaN 0.0 335.0 335.0 0.0 Tumor
C3L-00090 0.539019 4.14 UCEC Yes Tumor NaN NaN 75 Female White ... No NaN NaN 50.0 56.0 1.0 1281.0 1287.0 1.0 Tumor

5 rows × 127 columns

multi_join does not necessarily need to join different dataframes. If you just want a small amount of information from a dataframe, this function is useful for that as well.

In [9]:
sample_type_and_discovery = en.multi_join({"mssm clinical": ['type_of_analyzed_samples', 'discovery_study']})
sample_type_and_discovery.head()
Out[9]:
Name type_of_analyzed_samples_mssm_clinical type_of_analyzed_samples_mssm_clinical discovery_study
Patient_ID
C3L-00006 Tumor_and_Normal NaN Yes
C3L-00008 Tumor NaN Yes
C3L-00032 Tumor NaN Yes
C3L-00084 Tumor NaN Yes
C3L-00090 Tumor NaN Yes

Join omics to mutations¶

Joining an -omics dataframe with the mutation data for a specified gene or genes involves specific steps. It's worth noting that because there might be multiple mutations for one gene in a single sample, the mutation type and location data are returned in lists by default, even if there is only one mutation.

For samples with no mutation for a particular gene, the list will contain either "Wildtype_Tumor" or "Wildtype_Normal", depending on whether the sample is a tumor or normal one. The mutation status column will contain either "Single_mutation", "Multiple_mutation", "Wildtype_Tumor", or "Wildtype_Normal", which aids with parsing.

Let's consider an example:

In [10]:
somatic_mutations = en.get_somatic_mutation('harmonized')
selected_prot_and_som_mut = en.join_omics_to_mutations(
    omics_name = "proteomics",
    mutations_genes = "SHANK2",
    omics_genes = ["ARF5", "M6PR"],
    omics_source = 'umich',
    mutations_source = 'harmonized')
selected_prot_and_som_mut.head(10)
cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 141 samples for the SHANK2 gene (C:\Users\sabme\anaconda3\lib\site-packages\cptac\cancers\cancer.py, line 325)
Out[10]:
Name ARF5_umich_proteomics M6PR_umich_proteomics SHANK2_Mutation SHANK2_Location SHANK2_Mutation_Status Sample_Status
Patient_ID
C3L-00006 -0.056513 0.016557 [Missense_Mutation] [p.S1692R] Single_mutation Tumor
C3L-00008 0.549959 -0.206129 [Wildtype_Tumor] [No_mutation] Wildtype_Tumor Tumor
C3L-00032 0.088681 -0.154447 [Wildtype_Tumor] [No_mutation] Wildtype_Tumor Tumor
C3L-00084 -0.846555 0.027740 [Wildtype_Tumor] [No_mutation] Wildtype_Tumor Tumor
C3L-00090 0.539019 0.956619 [Wildtype_Tumor] [No_mutation] Wildtype_Tumor Tumor
C3L-00098 -0.017370 0.125574 [Wildtype_Tumor] [No_mutation] Wildtype_Tumor Tumor
C3L-00136 0.230347 0.575436 [Wildtype_Tumor] [No_mutation] Wildtype_Tumor Tumor
C3L-00137 0.191915 0.113577 [Wildtype_Tumor] [No_mutation] Wildtype_Tumor Tumor
C3L-00139 -0.410142 0.381355 [Wildtype_Tumor] [No_mutation] Wildtype_Tumor Tumor
C3L-00143 -0.170514 1.008577 [Wildtype_Tumor] [No_mutation] Wildtype_Tumor Tumor

In the code above, we're joining proteomics data and somatic mutation data. The gene for the mutation data is "SHANK2" and the genes for the proteomics data are "ARF5" and "M6PR".

Filtering multiple mutations¶

If there are multiple mutations, you can use the multi_join function to filter them. The function allows you to specify certain mutation types or locations to prioritize, and it provides a default sorting hierarchy for all other mutations.

Here are some examples:

In [11]:
SHANK2_default_filter = en.multi_join({"umich proteomics": ["ARF5", "M6PR"],
                                     "harmonized somatic_mutation": "SHANK2"},
                                    mutations_filter=[])

SHANK2_simple_filter = en.multi_join({"umich proteomics": ["ARF5", "M6PR"],
                                    "harmonized somatic_mutation": "SHANK2"},
                                   mutations_filter=["Missense_Mutation"])

PTEN_complex_filter = en.multi_join({"umich proteomics": ["ARF5", "M6PR"],
                                    "harmonized somatic_mutation": "SHANK2"}, 
                                    mutations_filter=["p.R130Q", "Nonsense_Mutation"])
cptac warning: Unknown mutation type Intron. Assigned lowest priority in filtering. (C:\Users\sabme\anaconda3\lib\site-packages\cptac\cancers\cancer.py, line 525)
cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 141 samples for the SHANK2 gene (C:\Users\sabme\AppData\Local\Temp\ipykernel_2264\3972322211.py, line 1)
cptac warning: Unknown mutation type Intron. Assigned lowest priority in filtering. (C:\Users\sabme\anaconda3\lib\site-packages\cptac\cancers\cancer.py, line 525)
cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 141 samples for the SHANK2 gene (C:\Users\sabme\AppData\Local\Temp\ipykernel_2264\3972322211.py, line 5)
cptac warning: Filter value p.R130Q does not exist in the mutations data for the SHANK2 gene, though it exists for other genes. (C:\Users\sabme\anaconda3\lib\site-packages\cptac\cancers\cancer.py, line 525)
cptac warning: Unknown mutation type Intron. Assigned lowest priority in filtering. (C:\Users\sabme\anaconda3\lib\site-packages\cptac\cancers\cancer.py, line 525)
cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 141 samples for the SHANK2 gene (C:\Users\sabme\AppData\Local\Temp\ipykernel_2264\3972322211.py, line 9)

The mutations_filter parameter allows you to specify the mutations you're interested in. If you don't provide any specific mutations (i.e., you pass an empty list), it will use a default hierarchy, choosing truncation mutations over missense mutations, and silent mutations last of all. If there are multiple mutations of the same type, it chooses the mutation occurring earlier in the sequence.

Join metadata to mutations¶

Joining metadata to mutation data follows the same process as joining other datatypes. You can also use the mutations_filter parameter to filter multiple mutations.

For instance, you can use the get_clinical function to retrieve clinical data, as shown below:

In [12]:
en.get_clinical('mssm')
Out[12]:
Name tumor_code discovery_study type_of_analyzed_samples confirmatory_study type_of_analyzed_samples age sex race ethnicity ethnicity_race_ancestry_identified ... additional_treatment_pharmaceutical_therapy_for_new_tumor additional_treatment_immuno_for_new_tumor number_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_loco-regional number_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_metastasis Recurrence-free survival, days Recurrence-free survival from collection, days Recurrence status (1, yes; 0, no) Overall survival, days Overall survival from collection, days Survival status (1, dead; 0, alive)
Patient_ID
C3L-00006 UCEC Yes Tumor_and_Normal NaN NaN 64 Female White Not Hispanic or Latino White ... NaN NaN NaN NaN NaN NaN 0 737.0 737.0 0.0
C3L-00008 UCEC Yes Tumor NaN NaN 58 Female White Not Hispanic or Latino White ... NaN NaN NaN NaN NaN NaN 0 898.0 898.0 0.0
C3L-00032 UCEC Yes Tumor NaN NaN 50 Female White Not Hispanic or Latino White ... NaN NaN NaN NaN NaN NaN 0 1710.0 1710.0 0.0
C3L-00084 UCEC Yes Tumor NaN NaN 74 Female White Not Hispanic or Latino White ... NaN NaN NaN NaN NaN NaN 0 335.0 335.0 0.0
C3L-00090 UCEC Yes Tumor NaN NaN 75 Female White Not Hispanic or Latino White ... Yes No NaN NaN 50.0 56.0 1 1281.0 1287.0 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
C3N-01520 UCEC Yes Tumor NaN NaN 69 Female Unknown Unknown Slavonic ... NaN NaN NaN NaN NaN NaN 0 287.0 278.0 1.0
C3N-01521 UCEC Yes Tumor NaN NaN 75 Female Unknown Unknown Slavonic ... NaN NaN NaN NaN NaN NaN 0 728.0 681.0 0.0
C3N-01537 UCEC Yes Tumor NaN NaN 74 Female Unknown Unknown Slavonic ... Yes No 62.0 NaN 58.0 31.0 1 698.0 671.0 0.0
C3N-01802 UCEC Yes Tumor NaN NaN 85 Female Black or African American Not Hispanic or Latino American ... No No NaN NaN 598.0 563.0 1 775.0 740.0 0.0
C3N-01825 UCEC Yes Tumor NaN NaN 70 Female Unknown Unknown Slavonic ... NaN NaN NaN NaN NaN NaN 0 687.0 661.0 0.0

103 rows × 124 columns

In [13]:
en.join_metadata_to_mutations(
    metadata_name="clinical",
    metadata_source="mssm",
    metadata_cols=["age", "sex", "race"],
    mutations_source="harmonized",
    mutations_genes="SHANK2",
    mutations_filter=["Missense_Mutation"])
cptac warning: Unknown mutation type Intron. Assigned lowest priority in filtering. (C:\Users\sabme\anaconda3\lib\site-packages\cptac\cancers\cancer.py, line 525)
cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 92 samples for the SHANK2 gene (C:\Users\sabme\anaconda3\lib\site-packages\cptac\cancers\cancer.py, line 437)
Out[13]:
Name age sex race SHANK2_Mutation SHANK2_Location SHANK2_Mutation_Status Sample_Status
Patient_ID
C3L-00006 64 Female White Missense_Mutation p.S1692R Single_mutation Tumor
C3L-00008 58 Female White Wildtype_Tumor No_mutation Wildtype_Tumor Tumor
C3L-00032 50 Female White Wildtype_Tumor No_mutation Wildtype_Tumor Tumor
C3L-00084 74 Female White Wildtype_Tumor No_mutation Wildtype_Tumor Tumor
C3L-00090 75 Female White Wildtype_Tumor No_mutation Wildtype_Tumor Tumor
... ... ... ... ... ... ... ...
C3N-01520 69 Female Unknown Missense_Mutation p.P1586S Single_mutation Tumor
C3N-01521 75 Female Unknown Wildtype_Tumor No_mutation Wildtype_Tumor Tumor
C3N-01537 74 Female Unknown Wildtype_Tumor No_mutation Wildtype_Tumor Tumor
C3N-01802 85 Female Black or African American Wildtype_Tumor No_mutation Wildtype_Tumor Tumor
C3N-01825 70 Female Unknown Wildtype_Tumor No_mutation Wildtype_Tumor Tumor

103 rows × 7 columns

This command joins the age, sex, and race metadata with the mutation data for the SHANK2 gene, filtering out all mutations except Missense_Mutations.

If you need to join metadata to a larger number of mutation genes, the multi_join function can be useful. Below, we join the same metadata with the mutation data for SHANK2, PTEN, and TP53 genes. Here we do not filter mutations. Remember, by default, the mutations_filter parameter of multi_join behaves the same as the join_metadata_to_mutations function - it returns all mutations as lists in the output dataframe, regardless of the number of mutations for a given sample.

In [14]:
en.multi_join({"mssm clinical": ["age", "sex", "race"],
               "harmonized somatic_mutation": ["SHANK2", "PTEN", "TP53"]})
cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 92 samples for the SHANK2 gene, 28 samples for the PTEN gene, 80 samples for the TP53 gene (C:\Users\sabme\AppData\Local\Temp\ipykernel_2264\3189298179.py, line 1)
Out[14]:
Name age sex race SHANK2_Mutation SHANK2_Location SHANK2_Mutation_Status PTEN_Mutation PTEN_Location PTEN_Mutation_Status TP53_Mutation TP53_Location TP53_Mutation_Status Sample_Status
Patient_ID
C3L-00006 64 Female White [Missense_Mutation] [p.S1692R] Single_mutation [Missense_Mutation, Nonsense_Mutation] [p.R130Q, p.R233*] Multiple_mutation [Missense_Mutation] [p.R248W] Single_mutation Tumor
C3L-00008 58 Female White [Wildtype_Tumor] [No_mutation] Wildtype_Tumor [Missense_Mutation] [p.G127R] Single_mutation [Wildtype_Tumor] [No_mutation] Wildtype_Tumor Tumor
C3L-00032 50 Female White [Wildtype_Tumor] [No_mutation] Wildtype_Tumor [Nonsense_Mutation] [p.W111*] Single_mutation [Wildtype_Tumor] [No_mutation] Wildtype_Tumor Tumor
C3L-00084 74 Female White [Wildtype_Tumor] [No_mutation] Wildtype_Tumor [Wildtype_Tumor] [No_mutation] Wildtype_Tumor [Wildtype_Tumor] [No_mutation] Wildtype_Tumor Tumor
C3L-00090 75 Female White [Wildtype_Tumor] [No_mutation] Wildtype_Tumor [Missense_Mutation] [p.R130G] Single_mutation [Wildtype_Tumor] [No_mutation] Wildtype_Tumor Tumor
... ... ... ... ... ... ... ... ... ... ... ... ... ...
C3N-01520 69 Female Unknown [Missense_Mutation] [p.P1586S] Single_mutation [Frame_Shift_Del, Frame_Shift_Ins] [p.N323fs, p.D268fs] Multiple_mutation [Wildtype_Tumor] [No_mutation] Wildtype_Tumor Tumor
C3N-01521 75 Female Unknown [Wildtype_Tumor] [No_mutation] Wildtype_Tumor [Wildtype_Tumor] [No_mutation] Wildtype_Tumor [Missense_Mutation] [p.H193L] Single_mutation Tumor
C3N-01537 74 Female Unknown [Wildtype_Tumor] [No_mutation] Wildtype_Tumor [Wildtype_Tumor] [No_mutation] Wildtype_Tumor [Wildtype_Tumor] [No_mutation] Wildtype_Tumor Tumor
C3N-01802 85 Female Black or African American [Wildtype_Tumor] [No_mutation] Wildtype_Tumor [Wildtype_Tumor] [No_mutation] Wildtype_Tumor [Missense_Mutation] [p.P27S] Single_mutation Tumor
C3N-01825 70 Female Unknown [Wildtype_Tumor] [No_mutation] Wildtype_Tumor [Wildtype_Tumor] [No_mutation] Wildtype_Tumor [Missense_Mutation] [p.R175H] Single_mutation Tumor

103 rows × 13 columns

Here is an example of joining clinical data with mutations while filtering specific mutations:

In [15]:
survival_and_SHANK2 = en.multi_join({"mssm clinical": ["age", "sex", "race"],
               "harmonized somatic_mutation": ["SHANK2", "PTEN", "TP53"]}, 
               mutations_filter=["Missense_Mutation"])

survival_and_SHANK2
cptac warning: Unknown mutation type Intron. Assigned lowest priority in filtering. (C:\Users\sabme\anaconda3\lib\site-packages\cptac\cancers\cancer.py, line 525)
cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 92 samples for the SHANK2 gene, 28 samples for the PTEN gene, 80 samples for the TP53 gene (C:\Users\sabme\AppData\Local\Temp\ipykernel_2264\3101478147.py, line 1)
Out[15]:
Name age sex race SHANK2_Mutation SHANK2_Location SHANK2_Mutation_Status PTEN_Mutation PTEN_Location PTEN_Mutation_Status TP53_Mutation TP53_Location TP53_Mutation_Status Sample_Status
Patient_ID
C3L-00006 64 Female White Missense_Mutation p.S1692R Single_mutation Missense_Mutation p.R130Q Multiple_mutation Missense_Mutation p.R248W Single_mutation Tumor
C3L-00008 58 Female White Wildtype_Tumor No_mutation Wildtype_Tumor Missense_Mutation p.G127R Single_mutation Wildtype_Tumor No_mutation Wildtype_Tumor Tumor
C3L-00032 50 Female White Wildtype_Tumor No_mutation Wildtype_Tumor Nonsense_Mutation p.W111* Single_mutation Wildtype_Tumor No_mutation Wildtype_Tumor Tumor
C3L-00084 74 Female White Wildtype_Tumor No_mutation Wildtype_Tumor Wildtype_Tumor No_mutation Wildtype_Tumor Wildtype_Tumor No_mutation Wildtype_Tumor Tumor
C3L-00090 75 Female White Wildtype_Tumor No_mutation Wildtype_Tumor Missense_Mutation p.R130G Single_mutation Wildtype_Tumor No_mutation Wildtype_Tumor Tumor
... ... ... ... ... ... ... ... ... ... ... ... ... ...
C3N-01520 69 Female Unknown Missense_Mutation p.P1586S Single_mutation Frame_Shift_Ins p.D268fs Multiple_mutation Wildtype_Tumor No_mutation Wildtype_Tumor Tumor
C3N-01521 75 Female Unknown Wildtype_Tumor No_mutation Wildtype_Tumor Wildtype_Tumor No_mutation Wildtype_Tumor Missense_Mutation p.H193L Single_mutation Tumor
C3N-01537 74 Female Unknown Wildtype_Tumor No_mutation Wildtype_Tumor Wildtype_Tumor No_mutation Wildtype_Tumor Wildtype_Tumor No_mutation Wildtype_Tumor Tumor
C3N-01802 85 Female Black or African American Wildtype_Tumor No_mutation Wildtype_Tumor Wildtype_Tumor No_mutation Wildtype_Tumor Missense_Mutation p.P27S Single_mutation Tumor
C3N-01825 70 Female Unknown Wildtype_Tumor No_mutation Wildtype_Tumor Wildtype_Tumor No_mutation Wildtype_Tumor Missense_Mutation p.R175H Single_mutation Tumor

103 rows × 13 columns

Remember that the mutations_filter parameter receives a list. In this example, it is filtering only the "Missense_Mutation" type for all genes specified.

Exporting dataframes¶

If you wish to export a dataframe to a file, simply call the dataframe's to_csv method, passing the path you wish to save the file to, and the value separator you want:

In [16]:
survival_and_SHANK2.to_csv(path_or_buf="histologic_type_and_PTEN_mutation.tsv", sep='\t')