Tutorial 3: Joining dataframes with cptac
¶
In this tutorial, we provide several examples of how to use the built-in cptac
functions for joining different dataframes.
We will do this on data for Endometrial carcinoma. First we need to import the package and create an endometrial data object, which we call 'en'.
# Start by importing the cptac package
import cptac
# Create an endometrial data object, named 'en'
en = cptac.Ucec()
# List the available data sources
en.list_data_sources()
Data type | Available sources | |
---|---|---|
0 | CNV | [bcm, washu] |
1 | circular_RNA | [bcm] |
2 | miRNA | [bcm, washu] |
3 | proteomics | [bcm, umich] |
4 | transcriptomics | [bcm, broad, washu] |
5 | ancestry_prediction | [harmonized] |
6 | somatic_mutation | [harmonized, washu] |
7 | clinical | [mssm] |
8 | follow-up | [mssm] |
9 | medical_history | [mssm] |
10 | acetylproteomics | [umich] |
11 | phosphoproteomics | [umich] |
12 | cibersort | [washu] |
13 | hla_typing | [washu] |
14 | tumor_purity | [washu] |
15 | xcell | [washu] |
en.list_data_sources() shows the types of data available in the dataset and their respective sources. For example, you see proteomics data is available from umich, transcriptomics data from bcm, broad, washu and so forth.
# Retrieve the transcriptomics data from bcm
bcm_data = en.get_transcriptomics('bcm')
# Display the first few rows of the dataframe
bcm_data.head()
Name | A1BG | A1BG-AS1 | A1CF | A2M | A2M-AS1 | A2ML1 | A2ML1-AS1 | A2ML1-AS2 | A2MP1 | A3GALT2 | ... | ZXDB | ZXDC | ZYG11A | ZYG11AP1 | ZYG11B | ZYX | ZYXP1 | ZZEF1 | hsa-mir-1253 | hsa-mir-423 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Database_ID | ENSG00000121410.12 | ENSG00000268895.6 | ENSG00000148584.15 | ENSG00000175899.15 | ENSG00000245105.4 | ENSG00000166535.20 | ENSG00000256661.1 | ENSG00000256904.1 | ENSG00000256069.7 | ENSG00000184389.9 | ... | ENSG00000198455.4 | ENSG00000070476.15 | ENSG00000203995.10 | ENSG00000232242.2 | ENSG00000162378.13 | ENSG00000159840.16 | ENSG00000274572.1 | ENSG00000074755.15 | ENSG00000272920.1 | ENSG00000266919.3 |
Patient_ID | |||||||||||||||||||||
C3L-00006 | 2.54 | 5.11 | 3.60 | 13.75 | 6.45 | 7.08 | 1.80 | 0.00 | 2.60 | 1.16 | ... | 10.17 | 10.61 | 5.54 | 0.0 | 11.85 | 10.60 | 0.0 | 11.87 | 0.0 | 0.0 |
C3L-00008 | 4.40 | 4.63 | 5.49 | 13.89 | 6.61 | 6.97 | 0.00 | 2.74 | 3.25 | 0.00 | ... | 9.79 | 10.48 | 7.79 | 0.0 | 12.28 | 11.28 | 0.0 | 11.93 | 0.0 | 0.0 |
C3L-00032 | 4.83 | 7.26 | 3.73 | 14.48 | 6.91 | 9.56 | 0.98 | 0.00 | 3.26 | 0.00 | ... | 9.43 | 9.97 | 6.48 | 0.0 | 11.72 | 10.37 | 0.0 | 11.70 | 0.0 | 0.0 |
C3L-00084 | 4.73 | 6.01 | 5.37 | 15.17 | 7.93 | 3.86 | 0.00 | 0.00 | 3.73 | 1.15 | ... | 9.23 | 10.37 | 7.47 | 0.0 | 11.86 | 10.13 | 0.0 | 11.19 | 0.0 | 0.0 |
C3L-00090 | 4.14 | 6.24 | 5.69 | 13.87 | 6.79 | 4.32 | 0.00 | 0.00 | 3.23 | 0.00 | ... | 9.69 | 9.64 | 7.60 | 0.0 | 11.98 | 10.31 | 0.0 | 11.45 | 0.0 | 0.0 |
5 rows × 59286 columns
In the above code, get_transcriptomics('bcm') is used to retrieve the transcriptomics data from bcm. Each row represents a different patient, and each column corresponds to a different gene.
General format¶
cptac has a helpful function called multi_join
. It allows data from several different cptac dataframes to be joined at the same time.
To use multi_join
, you specify the dataframes you want to join by passing a dictionary of their names to the function call. The function will automatically check that the dataframes whose names you provided are valid for the join function, and print an error message if they aren't.
Whenever a column from an -omics dataframe is included in a joined table, the name of the -omics dataframe it came from is joined to the column header, to avoid confusion.
If you wish to only include particular columns in the join, include them as values in the dictionary. All values will accept either a single column name as a string, or a list of column name strings. In this use case, we will usually only select specific columns for readability, but you could select the whole dataframe in all these cases, except for the mutations dataframe.
The join functions use logic analogous to an SQL INNER JOIN.
Join dictionary¶
The main parameter for the multi_join
function is a dictionary with source and datatype as a key, and specific columns as a value. Because there are multiple sources for each datatype, the desired source needs to be included. This can be done in two different ways. The first is by using a string that contains the source, a space, and then the datatype. The second is by using a tuple formatted (source, datatype). For example, using:
{('umich', 'proteomics'): ''}
or
{"umich proteomics": ''}
as the join dictionary would each result in multi_join
returning a dataframe containing only awg proteomics data.
You'll notice the value in the key:value pair is an empty string. Because a dictionary needs to have a value for each key, the empty string or an empty list mean we want everything from the specified dataframe. If a string or list of strings is specified, the joined dataframe will only contain the specified columns. See below for more examples.
Join omics to omics¶
multi_join
can join two -omics dataframes to each other. Types of -omics data valid for use with this function are acetylproteomics, CNV, phosphoproteomics, phosphoproteomics_gene, proteomics, and transcriptomics.
# Joining two -omics dataframes together using multi_join
prot_and_tran = en.multi_join({"umich proteomics":'', "bcm transcriptomics":''})
prot_and_tran.head()
cptac warning: Your version of cptac (1.5.1) is out-of-date. Latest is 1.5.0. Please run 'pip install --upgrade cptac' to update it. (C:\Users\sabme\anaconda3\lib\threading.py, line 910)
Name | ARF5_umich_proteomics | M6PR_umich_proteomics | ESRRA_umich_proteomics | FKBP4_umich_proteomics | NDUFAF7_umich_proteomics | FUCA2_umich_proteomics | DBNDD1_umich_proteomics | SEMA3F_umich_proteomics | CFTR_umich_proteomics | CYP51A1_umich_proteomics | ... | ZXDB_bcm_transcriptomics | ZXDC_bcm_transcriptomics | ZYG11A_bcm_transcriptomics | ZYG11AP1_bcm_transcriptomics | ZYG11B_bcm_transcriptomics | ZYX_bcm_transcriptomics | ZYXP1_bcm_transcriptomics | ZZEF1_bcm_transcriptomics | hsa-mir-1253_bcm_transcriptomics | hsa-mir-423_bcm_transcriptomics |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Database_ID | ENSP00000000233.5 | ENSP00000000412.3 | ENSP00000000442.6 | ENSP00000001008.4 | ENSP00000002125.4 | ENSP00000002165.5 | ENSP00000002501.6 | ENSP00000002829.3 | ENSP00000003084.6 | ENSP00000003100.8 | ... | ENSG00000198455.4 | ENSG00000070476.15 | ENSG00000203995.10 | ENSG00000232242.2 | ENSG00000162378.13 | ENSG00000159840.16 | ENSG00000274572.1 | ENSG00000074755.15 | ENSG00000272920.1 | ENSG00000266919.3 |
Patient_ID | |||||||||||||||||||||
C3L-00006 | -0.056513 | 0.016557 | 0.002569 | 0.389819 | 0.603610 | -0.332543 | -0.790426 | NaN | 0.822732 | 0.039134 | ... | 10.17 | 10.61 | 5.54 | 0.0 | 11.85 | 10.60 | 0.0 | 11.87 | 0.0 | 0.0 |
C3L-00008 | 0.549959 | -0.206129 | 0.905784 | -0.303631 | 0.018767 | 0.503513 | 0.950955 | 0.080142 | NaN | -0.063213 | ... | 9.79 | 10.48 | 7.79 | 0.0 | 12.28 | 11.28 | 0.0 | 11.93 | 0.0 | 0.0 |
C3L-00032 | 0.088681 | -0.154447 | -0.190515 | 0.170753 | 0.196356 | 0.544194 | -0.179078 | NaN | NaN | 0.377405 | ... | 9.43 | 9.97 | 6.48 | 0.0 | 11.72 | 10.37 | 0.0 | 11.70 | 0.0 | 0.0 |
C3L-00084 | -0.846555 | 0.027740 | NaN | 0.178700 | 0.264054 | -0.183548 | 0.077215 | -0.247164 | 0.152277 | -0.279549 | ... | 9.23 | 10.37 | 7.47 | 0.0 | 11.86 | 10.13 | 0.0 | 11.19 | 0.0 | 0.0 |
C3L-00090 | 0.539019 | 0.956619 | -0.039516 | 0.323656 | 0.064605 | 0.173433 | -0.524325 | -0.038590 | -0.311486 | 0.309905 | ... | 9.69 | 9.64 | 7.60 | 0.0 | 11.98 | 10.31 | 0.0 | 11.45 | 0.0 | 0.0 |
5 rows × 71948 columns
In this example, multi_join is used to join proteomics data from umich and transcriptomics data from bcm into one combined dataframe.
# Using multi_join with specified columns
prot_and_tran_selected = en.multi_join({"umich proteomics":'ARF5', "bcm transcriptomics":'A1BG'})
prot_and_tran_selected.head()
Name | ARF5_umich_proteomics | A1BG_bcm_transcriptomics |
---|---|---|
Database_ID | ENSP00000000233.5 | ENSG00000121410.12 |
Patient_ID | ||
C3L-00006 | -0.056513 | 2.54 |
C3L-00008 | 0.549959 | 4.40 |
C3L-00032 | 0.088681 | 4.83 |
C3L-00084 | -0.846555 | 4.73 |
C3L-00090 | 0.539019 | 4.14 |
Here, multi_join is used again, but this time only the 'ARF5' column from the proteomics data and the 'A1BG' column from the transcriptomics data are included in the resulting dataframe.
Join metadata to omics¶
The multi_join
function can also join a metadata dataframe (e.g. clinical or derived_molecular) with an -omics dataframe:
# Join a metadata dataframe with an -omics dataframe
clin_and_tran = en.multi_join({"mssm clinical":'', "bcm transcriptomics":''})
clin_and_tran.head()
Name | tumor_code | discovery_study | type_of_analyzed_samples_mssm_clinical | confirmatory_study | type_of_analyzed_samples_mssm_clinical | age | sex | race | ethnicity | ethnicity_race_ancestry_identified | ... | ZXDB_bcm_transcriptomics | ZXDC_bcm_transcriptomics | ZYG11A_bcm_transcriptomics | ZYG11AP1_bcm_transcriptomics | ZYG11B_bcm_transcriptomics | ZYX_bcm_transcriptomics | ZYXP1_bcm_transcriptomics | ZZEF1_bcm_transcriptomics | hsa-mir-1253_bcm_transcriptomics | hsa-mir-423_bcm_transcriptomics |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Database_ID | ... | ENSG00000198455.4 | ENSG00000070476.15 | ENSG00000203995.10 | ENSG00000232242.2 | ENSG00000162378.13 | ENSG00000159840.16 | ENSG00000274572.1 | ENSG00000074755.15 | ENSG00000272920.1 | ENSG00000266919.3 | ||||||||||
Patient_ID | |||||||||||||||||||||
C3L-00006 | UCEC | Yes | Tumor_and_Normal | NaN | NaN | 64 | Female | White | Not Hispanic or Latino | White | ... | 10.17 | 10.61 | 5.54 | 0.0 | 11.85 | 10.60 | 0.0 | 11.87 | 0.0 | 0.0 |
C3L-00008 | UCEC | Yes | Tumor | NaN | NaN | 58 | Female | White | Not Hispanic or Latino | White | ... | 9.79 | 10.48 | 7.79 | 0.0 | 12.28 | 11.28 | 0.0 | 11.93 | 0.0 | 0.0 |
C3L-00032 | UCEC | Yes | Tumor | NaN | NaN | 50 | Female | White | Not Hispanic or Latino | White | ... | 9.43 | 9.97 | 6.48 | 0.0 | 11.72 | 10.37 | 0.0 | 11.70 | 0.0 | 0.0 |
C3L-00084 | UCEC | Yes | Tumor | NaN | NaN | 74 | Female | White | Not Hispanic or Latino | White | ... | 9.23 | 10.37 | 7.47 | 0.0 | 11.86 | 10.13 | 0.0 | 11.19 | 0.0 | 0.0 |
C3L-00090 | UCEC | Yes | Tumor | NaN | NaN | 75 | Female | White | Not Hispanic or Latino | White | ... | 9.69 | 9.64 | 7.60 | 0.0 | 11.98 | 10.31 | 0.0 | 11.45 | 0.0 | 0.0 |
5 rows × 59410 columns
Joining only specific columns:
clin_and_tran = en.multi_join({"mssm clinical": ["age", "Overall survival, days"], "bcm transcriptomics": ["ZYX", 'ZZEF1']})
clin_and_tran.head()
Name | age | Overall survival, days | ZYX_bcm_transcriptomics | ZZEF1_bcm_transcriptomics |
---|---|---|---|---|
Database_ID | ENSG00000159840.16 | ENSG00000074755.15 | ||
Patient_ID | ||||
C3L-00006 | 64 | 737.0 | 10.60 | 11.87 |
C3L-00008 | 58 | 898.0 | 11.28 | 11.93 |
C3L-00032 | 50 | 1710.0 | 10.37 | 11.70 |
C3L-00084 | 74 | 335.0 | 10.13 | 11.19 |
C3L-00090 | 75 | 1281.0 | 10.31 | 11.45 |
Join metadata to metadata¶
Of course two metadata dataframes (e.g. clinical or derived_molecular) can also be joined together. Note how we passed a column name to select from the clinical dataframe, but passing an empty string ''
or an empty list []
for the column parameter for the derived_molecular dataframe caused the entire dataframe to be selected.
clin_and_tran = en.multi_join({
"mssm clinical": "",
"bcm transcriptomics": '' # Note that by using an empty string or list as the value, we join the entire dataframe
})
clin_and_tran.head()
Name | tumor_code | discovery_study | type_of_analyzed_samples_mssm_clinical | confirmatory_study | type_of_analyzed_samples_mssm_clinical | age | sex | race | ethnicity | ethnicity_race_ancestry_identified | ... | ZXDB_bcm_transcriptomics | ZXDC_bcm_transcriptomics | ZYG11A_bcm_transcriptomics | ZYG11AP1_bcm_transcriptomics | ZYG11B_bcm_transcriptomics | ZYX_bcm_transcriptomics | ZYXP1_bcm_transcriptomics | ZZEF1_bcm_transcriptomics | hsa-mir-1253_bcm_transcriptomics | hsa-mir-423_bcm_transcriptomics |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Database_ID | ... | ENSG00000198455.4 | ENSG00000070476.15 | ENSG00000203995.10 | ENSG00000232242.2 | ENSG00000162378.13 | ENSG00000159840.16 | ENSG00000274572.1 | ENSG00000074755.15 | ENSG00000272920.1 | ENSG00000266919.3 | ||||||||||
Patient_ID | |||||||||||||||||||||
C3L-00006 | UCEC | Yes | Tumor_and_Normal | NaN | NaN | 64 | Female | White | Not Hispanic or Latino | White | ... | 10.17 | 10.61 | 5.54 | 0.0 | 11.85 | 10.60 | 0.0 | 11.87 | 0.0 | 0.0 |
C3L-00008 | UCEC | Yes | Tumor | NaN | NaN | 58 | Female | White | Not Hispanic or Latino | White | ... | 9.79 | 10.48 | 7.79 | 0.0 | 12.28 | 11.28 | 0.0 | 11.93 | 0.0 | 0.0 |
C3L-00032 | UCEC | Yes | Tumor | NaN | NaN | 50 | Female | White | Not Hispanic or Latino | White | ... | 9.43 | 9.97 | 6.48 | 0.0 | 11.72 | 10.37 | 0.0 | 11.70 | 0.0 | 0.0 |
C3L-00084 | UCEC | Yes | Tumor | NaN | NaN | 74 | Female | White | Not Hispanic or Latino | White | ... | 9.23 | 10.37 | 7.47 | 0.0 | 11.86 | 10.13 | 0.0 | 11.19 | 0.0 | 0.0 |
C3L-00090 | UCEC | Yes | Tumor | NaN | NaN | 75 | Female | White | Not Hispanic or Latino | White | ... | 9.69 | 9.64 | 7.60 | 0.0 | 11.98 | 10.31 | 0.0 | 11.45 | 0.0 | 0.0 |
5 rows × 59410 columns
Join many datatypes together¶
If you need data from three or more dataframes, they can all simply be added to the joining dictionary. The only limit to the number of dataframes the joining dictionary parameter for multi_join
can take is your imagination.
joining_dictionary = {"umich proteomics": "ARF5", "bcm transcriptomics": "A1BG", "mssm clinical": [], "washu somatic_mutation": []}
en.multi_join(joining_dictionary).head()
Name | ARF5_umich_proteomics | A1BG_bcm_transcriptomics | tumor_code | discovery_study | type_of_analyzed_samples_mssm_clinical | confirmatory_study | type_of_analyzed_samples_mssm_clinical | age | sex | race | ... | additional_treatment_immuno_for_new_tumor | number_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_loco-regional | number_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_metastasis | Recurrence-free survival, days | Recurrence-free survival from collection, days | Recurrence status (1, yes; 0, no) | Overall survival, days | Overall survival from collection, days | Survival status (1, dead; 0, alive) | Sample_Status |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Database_ID | ENSP00000000233.5 | ENSG00000121410.12 | ... | ||||||||||||||||||
Patient_ID | |||||||||||||||||||||
C3L-00006 | -0.056513 | 2.54 | UCEC | Yes | Tumor_and_Normal | NaN | NaN | 64 | Female | White | ... | NaN | NaN | NaN | NaN | NaN | 0.0 | 737.0 | 737.0 | 0.0 | Tumor |
C3L-00008 | 0.549959 | 4.40 | UCEC | Yes | Tumor | NaN | NaN | 58 | Female | White | ... | NaN | NaN | NaN | NaN | NaN | 0.0 | 898.0 | 898.0 | 0.0 | Tumor |
C3L-00032 | 0.088681 | 4.83 | UCEC | Yes | Tumor | NaN | NaN | 50 | Female | White | ... | NaN | NaN | NaN | NaN | NaN | 0.0 | 1710.0 | 1710.0 | 0.0 | Tumor |
C3L-00084 | -0.846555 | 4.73 | UCEC | Yes | Tumor | NaN | NaN | 74 | Female | White | ... | NaN | NaN | NaN | NaN | NaN | 0.0 | 335.0 | 335.0 | 0.0 | Tumor |
C3L-00090 | 0.539019 | 4.14 | UCEC | Yes | Tumor | NaN | NaN | 75 | Female | White | ... | No | NaN | NaN | 50.0 | 56.0 | 1.0 | 1281.0 | 1287.0 | 1.0 | Tumor |
5 rows × 127 columns
multi_join
does not necessarily need to join different dataframes. If you just want a small amount of information from a dataframe, this function is useful for that as well.
sample_type_and_discovery = en.multi_join({"mssm clinical": ['type_of_analyzed_samples', 'discovery_study']})
sample_type_and_discovery.head()
Name | type_of_analyzed_samples_mssm_clinical | type_of_analyzed_samples_mssm_clinical | discovery_study |
---|---|---|---|
Patient_ID | |||
C3L-00006 | Tumor_and_Normal | NaN | Yes |
C3L-00008 | Tumor | NaN | Yes |
C3L-00032 | Tumor | NaN | Yes |
C3L-00084 | Tumor | NaN | Yes |
C3L-00090 | Tumor | NaN | Yes |
Join omics to mutations¶
Joining an -omics dataframe with the mutation data for a specified gene or genes involves specific steps. It's worth noting that because there might be multiple mutations for one gene in a single sample, the mutation type and location data are returned in lists by default, even if there is only one mutation.
For samples with no mutation for a particular gene, the list will contain either "Wildtype_Tumor" or "Wildtype_Normal", depending on whether the sample is a tumor or normal one. The mutation status column will contain either "Single_mutation", "Multiple_mutation", "Wildtype_Tumor", or "Wildtype_Normal", which aids with parsing.
Let's consider an example:
somatic_mutations = en.get_somatic_mutation('harmonized')
selected_prot_and_som_mut = en.join_omics_to_mutations(
omics_name = "proteomics",
mutations_genes = "SHANK2",
omics_genes = ["ARF5", "M6PR"],
omics_source = 'umich',
mutations_source = 'harmonized')
selected_prot_and_som_mut.head(10)
cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 141 samples for the SHANK2 gene (C:\Users\sabme\anaconda3\lib\site-packages\cptac\cancers\cancer.py, line 325)
Name | ARF5_umich_proteomics | M6PR_umich_proteomics | SHANK2_Mutation | SHANK2_Location | SHANK2_Mutation_Status | Sample_Status |
---|---|---|---|---|---|---|
Patient_ID | ||||||
C3L-00006 | -0.056513 | 0.016557 | [Missense_Mutation] | [p.S1692R] | Single_mutation | Tumor |
C3L-00008 | 0.549959 | -0.206129 | [Wildtype_Tumor] | [No_mutation] | Wildtype_Tumor | Tumor |
C3L-00032 | 0.088681 | -0.154447 | [Wildtype_Tumor] | [No_mutation] | Wildtype_Tumor | Tumor |
C3L-00084 | -0.846555 | 0.027740 | [Wildtype_Tumor] | [No_mutation] | Wildtype_Tumor | Tumor |
C3L-00090 | 0.539019 | 0.956619 | [Wildtype_Tumor] | [No_mutation] | Wildtype_Tumor | Tumor |
C3L-00098 | -0.017370 | 0.125574 | [Wildtype_Tumor] | [No_mutation] | Wildtype_Tumor | Tumor |
C3L-00136 | 0.230347 | 0.575436 | [Wildtype_Tumor] | [No_mutation] | Wildtype_Tumor | Tumor |
C3L-00137 | 0.191915 | 0.113577 | [Wildtype_Tumor] | [No_mutation] | Wildtype_Tumor | Tumor |
C3L-00139 | -0.410142 | 0.381355 | [Wildtype_Tumor] | [No_mutation] | Wildtype_Tumor | Tumor |
C3L-00143 | -0.170514 | 1.008577 | [Wildtype_Tumor] | [No_mutation] | Wildtype_Tumor | Tumor |
In the code above, we're joining proteomics data and somatic mutation data. The gene for the mutation data is "SHANK2" and the genes for the proteomics data are "ARF5" and "M6PR".
Filtering multiple mutations¶
If there are multiple mutations, you can use the multi_join function to filter them. The function allows you to specify certain mutation types or locations to prioritize, and it provides a default sorting hierarchy for all other mutations.
Here are some examples:
SHANK2_default_filter = en.multi_join({"umich proteomics": ["ARF5", "M6PR"],
"harmonized somatic_mutation": "SHANK2"},
mutations_filter=[])
SHANK2_simple_filter = en.multi_join({"umich proteomics": ["ARF5", "M6PR"],
"harmonized somatic_mutation": "SHANK2"},
mutations_filter=["Missense_Mutation"])
PTEN_complex_filter = en.multi_join({"umich proteomics": ["ARF5", "M6PR"],
"harmonized somatic_mutation": "SHANK2"},
mutations_filter=["p.R130Q", "Nonsense_Mutation"])
cptac warning: Unknown mutation type Intron. Assigned lowest priority in filtering. (C:\Users\sabme\anaconda3\lib\site-packages\cptac\cancers\cancer.py, line 525) cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 141 samples for the SHANK2 gene (C:\Users\sabme\AppData\Local\Temp\ipykernel_2264\3972322211.py, line 1) cptac warning: Unknown mutation type Intron. Assigned lowest priority in filtering. (C:\Users\sabme\anaconda3\lib\site-packages\cptac\cancers\cancer.py, line 525) cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 141 samples for the SHANK2 gene (C:\Users\sabme\AppData\Local\Temp\ipykernel_2264\3972322211.py, line 5) cptac warning: Filter value p.R130Q does not exist in the mutations data for the SHANK2 gene, though it exists for other genes. (C:\Users\sabme\anaconda3\lib\site-packages\cptac\cancers\cancer.py, line 525) cptac warning: Unknown mutation type Intron. Assigned lowest priority in filtering. (C:\Users\sabme\anaconda3\lib\site-packages\cptac\cancers\cancer.py, line 525) cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 141 samples for the SHANK2 gene (C:\Users\sabme\AppData\Local\Temp\ipykernel_2264\3972322211.py, line 9)
The mutations_filter parameter allows you to specify the mutations you're interested in. If you don't provide any specific mutations (i.e., you pass an empty list), it will use a default hierarchy, choosing truncation mutations over missense mutations, and silent mutations last of all. If there are multiple mutations of the same type, it chooses the mutation occurring earlier in the sequence.
Join metadata to mutations¶
Joining metadata to mutation data follows the same process as joining other datatypes. You can also use the mutations_filter parameter to filter multiple mutations.
For instance, you can use the get_clinical function to retrieve clinical data, as shown below:
en.get_clinical('mssm')
Name | tumor_code | discovery_study | type_of_analyzed_samples | confirmatory_study | type_of_analyzed_samples | age | sex | race | ethnicity | ethnicity_race_ancestry_identified | ... | additional_treatment_pharmaceutical_therapy_for_new_tumor | additional_treatment_immuno_for_new_tumor | number_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_loco-regional | number_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_metastasis | Recurrence-free survival, days | Recurrence-free survival from collection, days | Recurrence status (1, yes; 0, no) | Overall survival, days | Overall survival from collection, days | Survival status (1, dead; 0, alive) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Patient_ID | |||||||||||||||||||||
C3L-00006 | UCEC | Yes | Tumor_and_Normal | NaN | NaN | 64 | Female | White | Not Hispanic or Latino | White | ... | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 737.0 | 737.0 | 0.0 |
C3L-00008 | UCEC | Yes | Tumor | NaN | NaN | 58 | Female | White | Not Hispanic or Latino | White | ... | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 898.0 | 898.0 | 0.0 |
C3L-00032 | UCEC | Yes | Tumor | NaN | NaN | 50 | Female | White | Not Hispanic or Latino | White | ... | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 1710.0 | 1710.0 | 0.0 |
C3L-00084 | UCEC | Yes | Tumor | NaN | NaN | 74 | Female | White | Not Hispanic or Latino | White | ... | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 335.0 | 335.0 | 0.0 |
C3L-00090 | UCEC | Yes | Tumor | NaN | NaN | 75 | Female | White | Not Hispanic or Latino | White | ... | Yes | No | NaN | NaN | 50.0 | 56.0 | 1 | 1281.0 | 1287.0 | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
C3N-01520 | UCEC | Yes | Tumor | NaN | NaN | 69 | Female | Unknown | Unknown | Slavonic | ... | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 287.0 | 278.0 | 1.0 |
C3N-01521 | UCEC | Yes | Tumor | NaN | NaN | 75 | Female | Unknown | Unknown | Slavonic | ... | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 728.0 | 681.0 | 0.0 |
C3N-01537 | UCEC | Yes | Tumor | NaN | NaN | 74 | Female | Unknown | Unknown | Slavonic | ... | Yes | No | 62.0 | NaN | 58.0 | 31.0 | 1 | 698.0 | 671.0 | 0.0 |
C3N-01802 | UCEC | Yes | Tumor | NaN | NaN | 85 | Female | Black or African American | Not Hispanic or Latino | American | ... | No | No | NaN | NaN | 598.0 | 563.0 | 1 | 775.0 | 740.0 | 0.0 |
C3N-01825 | UCEC | Yes | Tumor | NaN | NaN | 70 | Female | Unknown | Unknown | Slavonic | ... | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 687.0 | 661.0 | 0.0 |
103 rows × 124 columns
en.join_metadata_to_mutations(
metadata_name="clinical",
metadata_source="mssm",
metadata_cols=["age", "sex", "race"],
mutations_source="harmonized",
mutations_genes="SHANK2",
mutations_filter=["Missense_Mutation"])
cptac warning: Unknown mutation type Intron. Assigned lowest priority in filtering. (C:\Users\sabme\anaconda3\lib\site-packages\cptac\cancers\cancer.py, line 525) cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 92 samples for the SHANK2 gene (C:\Users\sabme\anaconda3\lib\site-packages\cptac\cancers\cancer.py, line 437)
Name | age | sex | race | SHANK2_Mutation | SHANK2_Location | SHANK2_Mutation_Status | Sample_Status |
---|---|---|---|---|---|---|---|
Patient_ID | |||||||
C3L-00006 | 64 | Female | White | Missense_Mutation | p.S1692R | Single_mutation | Tumor |
C3L-00008 | 58 | Female | White | Wildtype_Tumor | No_mutation | Wildtype_Tumor | Tumor |
C3L-00032 | 50 | Female | White | Wildtype_Tumor | No_mutation | Wildtype_Tumor | Tumor |
C3L-00084 | 74 | Female | White | Wildtype_Tumor | No_mutation | Wildtype_Tumor | Tumor |
C3L-00090 | 75 | Female | White | Wildtype_Tumor | No_mutation | Wildtype_Tumor | Tumor |
... | ... | ... | ... | ... | ... | ... | ... |
C3N-01520 | 69 | Female | Unknown | Missense_Mutation | p.P1586S | Single_mutation | Tumor |
C3N-01521 | 75 | Female | Unknown | Wildtype_Tumor | No_mutation | Wildtype_Tumor | Tumor |
C3N-01537 | 74 | Female | Unknown | Wildtype_Tumor | No_mutation | Wildtype_Tumor | Tumor |
C3N-01802 | 85 | Female | Black or African American | Wildtype_Tumor | No_mutation | Wildtype_Tumor | Tumor |
C3N-01825 | 70 | Female | Unknown | Wildtype_Tumor | No_mutation | Wildtype_Tumor | Tumor |
103 rows × 7 columns
This command joins the age, sex, and race metadata with the mutation data for the SHANK2 gene, filtering out all mutations except Missense_Mutations.
If you need to join metadata to a larger number of mutation genes, the multi_join function can be useful. Below, we join the same metadata with the mutation data for SHANK2, PTEN, and TP53 genes. Here we do not filter mutations. Remember, by default, the mutations_filter parameter of multi_join behaves the same as the join_metadata_to_mutations function - it returns all mutations as lists in the output dataframe, regardless of the number of mutations for a given sample.
en.multi_join({"mssm clinical": ["age", "sex", "race"],
"harmonized somatic_mutation": ["SHANK2", "PTEN", "TP53"]})
cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 92 samples for the SHANK2 gene, 28 samples for the PTEN gene, 80 samples for the TP53 gene (C:\Users\sabme\AppData\Local\Temp\ipykernel_2264\3189298179.py, line 1)
Name | age | sex | race | SHANK2_Mutation | SHANK2_Location | SHANK2_Mutation_Status | PTEN_Mutation | PTEN_Location | PTEN_Mutation_Status | TP53_Mutation | TP53_Location | TP53_Mutation_Status | Sample_Status |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Patient_ID | |||||||||||||
C3L-00006 | 64 | Female | White | [Missense_Mutation] | [p.S1692R] | Single_mutation | [Missense_Mutation, Nonsense_Mutation] | [p.R130Q, p.R233*] | Multiple_mutation | [Missense_Mutation] | [p.R248W] | Single_mutation | Tumor |
C3L-00008 | 58 | Female | White | [Wildtype_Tumor] | [No_mutation] | Wildtype_Tumor | [Missense_Mutation] | [p.G127R] | Single_mutation | [Wildtype_Tumor] | [No_mutation] | Wildtype_Tumor | Tumor |
C3L-00032 | 50 | Female | White | [Wildtype_Tumor] | [No_mutation] | Wildtype_Tumor | [Nonsense_Mutation] | [p.W111*] | Single_mutation | [Wildtype_Tumor] | [No_mutation] | Wildtype_Tumor | Tumor |
C3L-00084 | 74 | Female | White | [Wildtype_Tumor] | [No_mutation] | Wildtype_Tumor | [Wildtype_Tumor] | [No_mutation] | Wildtype_Tumor | [Wildtype_Tumor] | [No_mutation] | Wildtype_Tumor | Tumor |
C3L-00090 | 75 | Female | White | [Wildtype_Tumor] | [No_mutation] | Wildtype_Tumor | [Missense_Mutation] | [p.R130G] | Single_mutation | [Wildtype_Tumor] | [No_mutation] | Wildtype_Tumor | Tumor |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
C3N-01520 | 69 | Female | Unknown | [Missense_Mutation] | [p.P1586S] | Single_mutation | [Frame_Shift_Del, Frame_Shift_Ins] | [p.N323fs, p.D268fs] | Multiple_mutation | [Wildtype_Tumor] | [No_mutation] | Wildtype_Tumor | Tumor |
C3N-01521 | 75 | Female | Unknown | [Wildtype_Tumor] | [No_mutation] | Wildtype_Tumor | [Wildtype_Tumor] | [No_mutation] | Wildtype_Tumor | [Missense_Mutation] | [p.H193L] | Single_mutation | Tumor |
C3N-01537 | 74 | Female | Unknown | [Wildtype_Tumor] | [No_mutation] | Wildtype_Tumor | [Wildtype_Tumor] | [No_mutation] | Wildtype_Tumor | [Wildtype_Tumor] | [No_mutation] | Wildtype_Tumor | Tumor |
C3N-01802 | 85 | Female | Black or African American | [Wildtype_Tumor] | [No_mutation] | Wildtype_Tumor | [Wildtype_Tumor] | [No_mutation] | Wildtype_Tumor | [Missense_Mutation] | [p.P27S] | Single_mutation | Tumor |
C3N-01825 | 70 | Female | Unknown | [Wildtype_Tumor] | [No_mutation] | Wildtype_Tumor | [Wildtype_Tumor] | [No_mutation] | Wildtype_Tumor | [Missense_Mutation] | [p.R175H] | Single_mutation | Tumor |
103 rows × 13 columns
Here is an example of joining clinical data with mutations while filtering specific mutations:
survival_and_SHANK2 = en.multi_join({"mssm clinical": ["age", "sex", "race"],
"harmonized somatic_mutation": ["SHANK2", "PTEN", "TP53"]},
mutations_filter=["Missense_Mutation"])
survival_and_SHANK2
cptac warning: Unknown mutation type Intron. Assigned lowest priority in filtering. (C:\Users\sabme\anaconda3\lib\site-packages\cptac\cancers\cancer.py, line 525) cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 92 samples for the SHANK2 gene, 28 samples for the PTEN gene, 80 samples for the TP53 gene (C:\Users\sabme\AppData\Local\Temp\ipykernel_2264\3101478147.py, line 1)
Name | age | sex | race | SHANK2_Mutation | SHANK2_Location | SHANK2_Mutation_Status | PTEN_Mutation | PTEN_Location | PTEN_Mutation_Status | TP53_Mutation | TP53_Location | TP53_Mutation_Status | Sample_Status |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Patient_ID | |||||||||||||
C3L-00006 | 64 | Female | White | Missense_Mutation | p.S1692R | Single_mutation | Missense_Mutation | p.R130Q | Multiple_mutation | Missense_Mutation | p.R248W | Single_mutation | Tumor |
C3L-00008 | 58 | Female | White | Wildtype_Tumor | No_mutation | Wildtype_Tumor | Missense_Mutation | p.G127R | Single_mutation | Wildtype_Tumor | No_mutation | Wildtype_Tumor | Tumor |
C3L-00032 | 50 | Female | White | Wildtype_Tumor | No_mutation | Wildtype_Tumor | Nonsense_Mutation | p.W111* | Single_mutation | Wildtype_Tumor | No_mutation | Wildtype_Tumor | Tumor |
C3L-00084 | 74 | Female | White | Wildtype_Tumor | No_mutation | Wildtype_Tumor | Wildtype_Tumor | No_mutation | Wildtype_Tumor | Wildtype_Tumor | No_mutation | Wildtype_Tumor | Tumor |
C3L-00090 | 75 | Female | White | Wildtype_Tumor | No_mutation | Wildtype_Tumor | Missense_Mutation | p.R130G | Single_mutation | Wildtype_Tumor | No_mutation | Wildtype_Tumor | Tumor |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
C3N-01520 | 69 | Female | Unknown | Missense_Mutation | p.P1586S | Single_mutation | Frame_Shift_Ins | p.D268fs | Multiple_mutation | Wildtype_Tumor | No_mutation | Wildtype_Tumor | Tumor |
C3N-01521 | 75 | Female | Unknown | Wildtype_Tumor | No_mutation | Wildtype_Tumor | Wildtype_Tumor | No_mutation | Wildtype_Tumor | Missense_Mutation | p.H193L | Single_mutation | Tumor |
C3N-01537 | 74 | Female | Unknown | Wildtype_Tumor | No_mutation | Wildtype_Tumor | Wildtype_Tumor | No_mutation | Wildtype_Tumor | Wildtype_Tumor | No_mutation | Wildtype_Tumor | Tumor |
C3N-01802 | 85 | Female | Black or African American | Wildtype_Tumor | No_mutation | Wildtype_Tumor | Wildtype_Tumor | No_mutation | Wildtype_Tumor | Missense_Mutation | p.P27S | Single_mutation | Tumor |
C3N-01825 | 70 | Female | Unknown | Wildtype_Tumor | No_mutation | Wildtype_Tumor | Wildtype_Tumor | No_mutation | Wildtype_Tumor | Missense_Mutation | p.R175H | Single_mutation | Tumor |
103 rows × 13 columns
Remember that the mutations_filter parameter receives a list. In this example, it is filtering only the "Missense_Mutation" type for all genes specified.
Exporting dataframes¶
If you wish to export a dataframe to a file, simply call the dataframe's to_csv method, passing the path you wish to save the file to, and the value separator you want:
survival_and_SHANK2.to_csv(path_or_buf="histologic_type_and_PTEN_mutation.tsv", sep='\t')