Tutorial 2: Accessing dataframes and using them with pandas
¶
A large portion of this package utilizes the functionality of pandas dataframes. Here we give a few specific instances of using dataframes with our package. For a more general overview of dataframes, see pandas documentation and tutorials.
This tutorial will go over getting data from the package (review of Tutorial 1) and various dataframe manipulations, like accessing rows and/or columns.
How do I install the package?¶
Install the package from the Python Package Index (PyPI) in the command line using the pip package installer, with the name of the package:
pip install cptac
How do I import the package once it has been installed?¶
The cptac package has several cancer data sets. First import the entire package by entering the following:
import cptac
The data in the package is broken into specific cancer types. All available types can be seen with the command cptac.list_cancer_options()
To get a data for a cancer, load the dataset object and assign it to a variable, like this:
en = cptac.Ucec()
import cptac
en = cptac.Ucec()
Can I use multiple datasets at the same time?¶
You can have multiple datasets loaded at the same time, just assign each one to its own variable.
ov = cptac.Ov()
co = cptac.Coad()
NOTE: When using multiple data sets, be sure to check that each function used matches the expected data set, as each data set uses the same API. For example, the command for retrieving clinical data is get_clinical()
for all data sets, so make sure not to retrieve ovarian clinical data ov.get_clinical()
when you meant to get the endometrial clinical data en.get_clinical()
.
How do I access a particular dataframe?¶
As a reminder of what we learned in Tutorial 1, you can access a specific dataframe by calling the dataset's get_dataframe
method and passing in a datatype and source. There are also helper "get" methods for each datatype, for example get_clinical
, which works the same as get_dataframe
but does not require the datatype parameter. (To see all available dataframes, call the dataset's list_data_sources
function, e.g. en.list_data_sources()
.)
en.list_data_sources()
Data type | Available sources | |
---|---|---|
0 | CNV | washu |
1 | acetylproteomics | umich |
2 | ancestry_prediction | harmonized |
3 | cibersort | washu |
4 | circular_RNA | bcm |
5 | clinical | mssm |
6 | follow-up | mssm |
7 | hla_typing | washu |
8 | medical_history | mssm |
9 | miRNA | washu |
10 | phosphoproteomics | umich |
11 | proteomics | umich |
12 | somatic_mutation | harmonized, washu |
13 | transcriptomics | bcm, broad, washu |
14 | tumor_purity | washu |
15 | xcell | washu |
#Let's get the proteomics data associated with Endometrial cancer; it was generated by the team at UMich
proteomics = en.get_proteomics('umich')
proteomics.head()
Name | ARF5 | M6PR | ESRRA | FKBP4 | NDUFAF7 | FUCA2 | DBNDD1 | SEMA3F | CFTR | CYP51A1 | ... | SCRIB | WIZ | BPIFB4 | LDB1 | WIZ | TSGA10 | RFX7 | SWSAP1 | MSANTD2 | SVIL |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Database_ID | ENSP00000000233.5 | ENSP00000000412.3 | ENSP00000000442.6 | ENSP00000001008.4 | ENSP00000002125.4 | ENSP00000002165.5 | ENSP00000002501.6 | ENSP00000002829.3 | ENSP00000003084.6 | ENSP00000003100.8 | ... | ENSP00000501177.1 | ENSP00000501256.3 | ENSP00000501266.1 | ENSP00000501277.1 | ENSP00000501300.1 | ENSP00000501312.1 | ENSP00000501317.1 | ENSP00000501355.1 | ENSP00000501466.1 | ENSP00000501521.1 |
Patient_ID | |||||||||||||||||||||
C3L-00006 | -0.056513 | 0.016557 | 0.002569 | 0.389819 | 0.603610 | -0.332543 | -0.790426 | NaN | 0.822732 | 0.039134 | ... | 0.161720 | -0.884807 | NaN | 0.268247 | 0.125392 | -0.880833 | 0.108554 | 0.107413 | -0.085833 | NaN |
C3L-00008 | 0.549959 | -0.206129 | 0.905784 | -0.303631 | 0.018767 | 0.503513 | 0.950955 | 0.080142 | NaN | -0.063213 | ... | NaN | 0.054284 | NaN | -0.106450 | 0.380557 | -0.756099 | 0.264611 | 0.044423 | -0.248319 | -1.206596 |
C3L-00032 | 0.088681 | -0.154447 | -0.190515 | 0.170753 | 0.196356 | 0.544194 | -0.179078 | NaN | NaN | 0.377405 | ... | -1.086905 | 0.055991 | NaN | -0.021986 | -0.229645 | 1.923986 | NaN | -0.176694 | -0.332384 | -1.330653 |
C3L-00084 | -0.846555 | 0.027740 | NaN | 0.178700 | 0.264054 | -0.183548 | 0.077215 | -0.247164 | 0.152277 | -0.279549 | ... | -0.125796 | 0.944212 | NaN | 0.917409 | 0.026862 | -0.885976 | -0.006510 | -0.014162 | 0.365158 | NaN |
C3L-00090 | 0.539019 | 0.956619 | -0.039516 | 0.323656 | 0.064605 | 0.173433 | -0.524325 | -0.038590 | -0.311486 | 0.309905 | ... | 0.853362 | -0.716947 | NaN | -0.286277 | -0.046076 | 0.089645 | -0.444506 | -0.072531 | -0.463495 | NaN |
5 rows × 12662 columns
How do I access specific columns in a dataframe?¶
You'll probably want to get a feel for what data is in the dataframes you load. For example, say we want to know what kind of data is included in our proteomics dataframe. Each column in that dataframe contains the proteomics data for a different protein.
You can view a list of column names (which is a list of protein names, in the case of the proteomics dataframe) by appending .columns
to the end of a dataframe variable. (If you wish to see all of the column names, even if there are a lot, append .columns.values
to the dataframe variable.)
proteomics.columns
MultiIndex([( 'ARF5', 'ENSP00000000233.5'), ( 'M6PR', 'ENSP00000000412.3'), ( 'ESRRA', 'ENSP00000000442.6'), ( 'FKBP4', 'ENSP00000001008.4'), ('NDUFAF7', 'ENSP00000002125.4'), ( 'FUCA2', 'ENSP00000002165.5'), ( 'DBNDD1', 'ENSP00000002501.6'), ( 'SEMA3F', 'ENSP00000002829.3'), ( 'CFTR', 'ENSP00000003084.6'), ('CYP51A1', 'ENSP00000003100.8'), ... ( 'SCRIB', 'ENSP00000501177.1'), ( 'WIZ', 'ENSP00000501256.3'), ( 'BPIFB4', 'ENSP00000501266.1'), ( 'LDB1', 'ENSP00000501277.1'), ( 'WIZ', 'ENSP00000501300.1'), ( 'TSGA10', 'ENSP00000501312.1'), ( 'RFX7', 'ENSP00000501317.1'), ( 'SWSAP1', 'ENSP00000501355.1'), ('MSANTD2', 'ENSP00000501466.1'), ( 'SVIL', 'ENSP00000501521.1')], names=['Name', 'Database_ID'], length=12662)
To access specific column (which is a specific protein's data, in the case of the proteomics dataframe), slice the column out of the dataframe using either of the following methods:
proteomics["A1BG"]
or
proteomics.A1BG
Both return the column as a pandas series. The first method is useful when the name of the column you want is stored as a string variable.
protein = "A1BG"
A1BG_col = proteomics[protein]
A1BG_col.head()
Database_ID | ENSP00000263100.2 |
---|---|
Patient_ID | |
C3L-00006 | -1.121101 |
C3L-00008 | -0.798504 |
C3L-00032 | -0.577203 |
C3L-00084 | 1.612713 |
C3L-00090 | -1.350755 |
This dataframe["col_name"]
syntax also allows for selection of multiple columns by entering a list of column names.
proteins = ["A1BG","PTEN","TP53"]
selected_prot = proteomics[proteins]
selected_prot.head()
Name | A1BG | PTEN | TP53 | |
---|---|---|---|---|
Database_ID | ENSP00000263100.2 | ENSP00000361021.3 | ENSP00000269305.4 | ENSP00000352610.4 |
Patient_ID | ||||
C3L-00006 | -1.121101 | -0.531482 | NaN | -0.218863 |
C3L-00008 | -0.798504 | -0.730528 | NaN | 0.147008 |
C3L-00032 | -0.577203 | -1.029402 | NaN | -1.149320 |
C3L-00084 | 1.612713 | -0.587541 | 1.263191 | 0.967249 |
C3L-00090 | -1.350755 | 0.591150 | NaN | -0.701003 |
How do I access specific rows in a dataframe?¶
You can access specific rows in a dataframe (which are specific samples, in the case of the CPTAC data) using the dataframe's .iloc
(by row number) or .loc
(by row name) method, which both return a pandas Series if you select one row, and a pandas DataFrame if you select multiple rows.
proteomics.iloc[0:5]
Name | ARF5 | M6PR | ESRRA | FKBP4 | NDUFAF7 | FUCA2 | DBNDD1 | SEMA3F | CFTR | CYP51A1 | ... | SCRIB | WIZ | BPIFB4 | LDB1 | WIZ | TSGA10 | RFX7 | SWSAP1 | MSANTD2 | SVIL |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Database_ID | ENSP00000000233.5 | ENSP00000000412.3 | ENSP00000000442.6 | ENSP00000001008.4 | ENSP00000002125.4 | ENSP00000002165.5 | ENSP00000002501.6 | ENSP00000002829.3 | ENSP00000003084.6 | ENSP00000003100.8 | ... | ENSP00000501177.1 | ENSP00000501256.3 | ENSP00000501266.1 | ENSP00000501277.1 | ENSP00000501300.1 | ENSP00000501312.1 | ENSP00000501317.1 | ENSP00000501355.1 | ENSP00000501466.1 | ENSP00000501521.1 |
Patient_ID | |||||||||||||||||||||
C3L-00006 | -0.056513 | 0.016557 | 0.002569 | 0.389819 | 0.603610 | -0.332543 | -0.790426 | NaN | 0.822732 | 0.039134 | ... | 0.161720 | -0.884807 | NaN | 0.268247 | 0.125392 | -0.880833 | 0.108554 | 0.107413 | -0.085833 | NaN |
C3L-00008 | 0.549959 | -0.206129 | 0.905784 | -0.303631 | 0.018767 | 0.503513 | 0.950955 | 0.080142 | NaN | -0.063213 | ... | NaN | 0.054284 | NaN | -0.106450 | 0.380557 | -0.756099 | 0.264611 | 0.044423 | -0.248319 | -1.206596 |
C3L-00032 | 0.088681 | -0.154447 | -0.190515 | 0.170753 | 0.196356 | 0.544194 | -0.179078 | NaN | NaN | 0.377405 | ... | -1.086905 | 0.055991 | NaN | -0.021986 | -0.229645 | 1.923986 | NaN | -0.176694 | -0.332384 | -1.330653 |
C3L-00084 | -0.846555 | 0.027740 | NaN | 0.178700 | 0.264054 | -0.183548 | 0.077215 | -0.247164 | 0.152277 | -0.279549 | ... | -0.125796 | 0.944212 | NaN | 0.917409 | 0.026862 | -0.885976 | -0.006510 | -0.014162 | 0.365158 | NaN |
C3L-00090 | 0.539019 | 0.956619 | -0.039516 | 0.323656 | 0.064605 | 0.173433 | -0.524325 | -0.038590 | -0.311486 | 0.309905 | ... | 0.853362 | -0.716947 | NaN | -0.286277 | -0.046076 | 0.089645 | -0.444506 | -0.072531 | -0.463495 | NaN |
5 rows × 12662 columns
S001_row = proteomics.loc["C3L-00006"]
S001_row.head()
Name Database_ID ARF5 ENSP00000000233.5 -0.056513 M6PR ENSP00000000412.3 0.016557 ESRRA ENSP00000000442.6 0.002569 FKBP4 ENSP00000001008.4 0.389819 NDUFAF7 ENSP00000002125.4 0.603610 Name: C3L-00006, dtype: float64
How do I access specific rows and columns?¶
In addition to selecting specific rows, you can also use .loc
to select a subset of rows and columns, using lists.
samples = ["C3L-00006","C3L-00032","C3L-00413"]
proteins = ["A1BG","PTEN","TP53"]
proteomics.loc[samples, proteins]
Name | A1BG | PTEN | TP53 | |
---|---|---|---|---|
Database_ID | ENSP00000263100.2 | ENSP00000361021.3 | ENSP00000269305.4 | ENSP00000352610.4 |
Patient_ID | ||||
C3L-00006 | -1.121101 | -0.531482 | NaN | -0.218863 |
C3L-00032 | -0.577203 | -1.029402 | NaN | -1.149320 |
C3L-00413 | -0.235120 | -1.355743 | -1.167929 | -0.491297 |
How can I search using conditional statements?¶
There are a variety of ways to use boolean statements to traverse a dataframe. A common way is to pass a boolean statement that selects the data you want to the .loc
function. For example, if we want to see all the data for samples that have a positive protein expression level for the A1BG protein, we would pass the .loc
function the boolean statement asking for rows containing values above zero for the A1BG column.
.loc
has many functionalities. For a full list, see pandas documentation for .loc and indexing and slicing.
a1bg_positive = proteomics.loc[proteomics["A1BG"]['ENSP00000263100.2'] > 0]
a1bg_positive.head()
Name | ARF5 | M6PR | ESRRA | FKBP4 | NDUFAF7 | FUCA2 | DBNDD1 | SEMA3F | CFTR | CYP51A1 | ... | SCRIB | WIZ | BPIFB4 | LDB1 | WIZ | TSGA10 | RFX7 | SWSAP1 | MSANTD2 | SVIL |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Database_ID | ENSP00000000233.5 | ENSP00000000412.3 | ENSP00000000442.6 | ENSP00000001008.4 | ENSP00000002125.4 | ENSP00000002165.5 | ENSP00000002501.6 | ENSP00000002829.3 | ENSP00000003084.6 | ENSP00000003100.8 | ... | ENSP00000501177.1 | ENSP00000501256.3 | ENSP00000501266.1 | ENSP00000501277.1 | ENSP00000501300.1 | ENSP00000501312.1 | ENSP00000501317.1 | ENSP00000501355.1 | ENSP00000501466.1 | ENSP00000501521.1 |
Patient_ID | |||||||||||||||||||||
C3L-00084 | -0.846555 | 0.027740 | NaN | 0.178700 | 0.264054 | -0.183548 | 0.077215 | -0.247164 | 0.152277 | -0.279549 | ... | -0.125796 | 0.944212 | NaN | 0.917409 | 0.026862 | -0.885976 | -0.006510 | -0.014162 | 0.365158 | NaN |
C3L-00771 | -0.367256 | -0.238946 | NaN | -0.401743 | -0.050914 | -0.705455 | 0.113656 | -0.128739 | -0.740824 | -0.389428 | ... | -0.270607 | 0.293873 | NaN | 0.070451 | -0.073260 | -1.095170 | 0.501484 | 0.367101 | 0.557083 | NaN |
C3L-00938 | -0.065230 | -0.331508 | 0.508535 | -0.195992 | -0.200816 | -0.095191 | 0.207192 | 0.404246 | 0.485247 | -0.651001 | ... | -0.893192 | 0.637911 | NaN | 0.200598 | -0.129226 | -0.073511 | -0.337527 | NaN | 0.156450 | 0.506879 |
C3L-01252 | 0.227334 | -0.015206 | 0.380634 | 0.169788 | -0.007953 | 0.272649 | -0.202259 | 0.035179 | NaN | -0.509328 | ... | -0.553422 | -0.179688 | NaN | 0.181879 | 0.076132 | 0.125096 | NaN | 0.721806 | 0.349941 | -0.548896 |
C3L-01253 | -0.430675 | -0.737361 | 0.555245 | 0.024818 | -0.094097 | -0.327127 | 0.411960 | NaN | NaN | -1.126608 | ... | -0.701504 | 0.296178 | NaN | 0.272620 | 0.280234 | -0.947285 | 0.300586 | 0.138227 | 0.476971 | 1.063495 |
5 rows × 12662 columns
For another example, suppose we wanted to separate clinical information for white or female patients, as recorded in the race/sex columns respectively.
clinical = en.get_clinical("mssm")
white_clinical = clinical.loc[clinical["race"] == "White"]
female_clinical = clinical.loc[clinical["sex"] == "Female"]
white_clinical.head()
Name | tumor_code | discovery_study | type_of_analyzed_samples | confirmatory_study | type_of_analyzed_samples | age | sex | race | ethnicity | ethnicity_race_ancestry_identified | ... | additional_treatment_pharmaceutical_therapy_for_new_tumor | additional_treatment_immuno_for_new_tumor | number_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_loco-regional | number_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_metastasis | Recurrence-free survival, days | Recurrence-free survival from collection, days | Recurrence status (1, yes; 0, no) | Overall survival, days | Overall survival from collection, days | Survival status (1, dead; 0, alive) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Patient_ID | |||||||||||||||||||||
C3L-00006 | UCEC | Yes | Tumor_and_Normal | NaN | NaN | 64 | Female | White | Not Hispanic or Latino | White | ... | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 737.0 | 737.0 | 0.0 |
C3L-00008 | UCEC | Yes | Tumor | NaN | NaN | 58 | Female | White | Not Hispanic or Latino | White | ... | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 898.0 | 898.0 | 0.0 |
C3L-00032 | UCEC | Yes | Tumor | NaN | NaN | 50 | Female | White | Not Hispanic or Latino | White | ... | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 1710.0 | 1710.0 | 0.0 |
C3L-00084 | UCEC | Yes | Tumor | NaN | NaN | 74 | Female | White | Not Hispanic or Latino | White | ... | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 335.0 | 335.0 | 0.0 |
C3L-00090 | UCEC | Yes | Tumor | NaN | NaN | 75 | Female | White | Not Hispanic or Latino | White | ... | Yes | No | NaN | NaN | 50.0 | 56.0 | 1 | 1281.0 | 1287.0 | 1.0 |
5 rows × 124 columns
female_clinical.head()
Name | tumor_code | discovery_study | type_of_analyzed_samples | confirmatory_study | type_of_analyzed_samples | age | sex | race | ethnicity | ethnicity_race_ancestry_identified | ... | additional_treatment_pharmaceutical_therapy_for_new_tumor | additional_treatment_immuno_for_new_tumor | number_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_loco-regional | number_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_metastasis | Recurrence-free survival, days | Recurrence-free survival from collection, days | Recurrence status (1, yes; 0, no) | Overall survival, days | Overall survival from collection, days | Survival status (1, dead; 0, alive) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Patient_ID | |||||||||||||||||||||
C3L-00006 | UCEC | Yes | Tumor_and_Normal | NaN | NaN | 64 | Female | White | Not Hispanic or Latino | White | ... | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 737.0 | 737.0 | 0.0 |
C3L-00008 | UCEC | Yes | Tumor | NaN | NaN | 58 | Female | White | Not Hispanic or Latino | White | ... | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 898.0 | 898.0 | 0.0 |
C3L-00032 | UCEC | Yes | Tumor | NaN | NaN | 50 | Female | White | Not Hispanic or Latino | White | ... | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 1710.0 | 1710.0 | 0.0 |
C3L-00084 | UCEC | Yes | Tumor | NaN | NaN | 74 | Female | White | Not Hispanic or Latino | White | ... | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 335.0 | 335.0 | 0.0 |
C3L-00090 | UCEC | Yes | Tumor | NaN | NaN | 75 | Female | White | Not Hispanic or Latino | White | ... | Yes | No | NaN | NaN | 50.0 | 56.0 | 1 | 1281.0 | 1287.0 | 1.0 |
5 rows × 124 columns
How do I export a dataframe to a file?¶
If you wish to export a dataframe to a file, call the dataframe's built-in to_csv
method, specifying the path you wish to save to, and the separator you wish to use:
clinical = en.get_clinical('mssm')
clinical.to_csv(path_or_buf="clinical_df.tsv", sep='\t')