Tutorial 2: Accessing dataframes and using them with `pandas`¶

A large portion of this package utilizes the functionality of pandas dataframes. Here we give a few specific instances of using dataframes with our package. For a more general overview of dataframes, see pandas documentation and tutorials.

This tutorial will go over getting data from the package (review of Tutorial 1) and various dataframe manipulations, like accessing rows and/or columns.

How do I install the package?¶

Install the package from the Python Package Index (PyPI) in the command line using the pip package installer, with the name of the package:

pip install cptac

How do I import the package once it has been installed?¶

The cptac package has several cancer data sets. First import the entire package by entering the following:

import cptac

The data in the package is broken into specific cancer types. All available types can be seen with the command cptac.list_cancer_options()

To get a data for a cancer, load the dataset object and assign it to a variable, like this:

en = cptac.Ucec()

In [1]:

import cptac
en = cptac.Ucec()

Can I use multiple datasets at the same time?¶

You can have multiple datasets loaded at the same time, just assign each one to its own variable.

In [2]:

ov = cptac.Ov()
co = cptac.Coad()

NOTE: When using multiple data sets, be sure to check that each function used matches the expected data set, as each data set uses the same API. For example, the command for retrieving clinical data is get_clinical() for all data sets, so make sure not to retrieve ovarian clinical data ov.get_clinical() when you meant to get the endometrial clinical data en.get_clinical().

How do I access a particular dataframe?¶

As a reminder of what we learned in Tutorial 1, you can access a specific dataframe by calling the dataset's get_dataframe method and passing in a datatype and source. There are also helper "get" methods for each datatype, for example get_clinical, which works the same as get_dataframe but does not require the datatype parameter. (To see all available dataframes, call the dataset's list_data_sources function, e.g. en.list_data_sources().)

In [3]:

en.list_data_sources()

Out[3]:

	Data type	Available sources
0	CNV	washu
1	acetylproteomics	umich
2	ancestry_prediction	harmonized
3	cibersort	washu
4	circular_RNA	bcm
5	clinical	mssm
6	follow-up	mssm
7	hla_typing	washu
8	medical_history	mssm
9	miRNA	washu
10	phosphoproteomics	umich
11	proteomics	umich
12	somatic_mutation	harmonized, washu
13	transcriptomics	bcm, broad, washu
14	tumor_purity	washu
15	xcell	washu

In [4]:

#Let's get the proteomics data associated with Endometrial cancer; it was generated by the team at UMich
proteomics = en.get_proteomics('umich')
proteomics.head()

Out[4]:

Name	ARF5	M6PR	ESRRA	FKBP4	NDUFAF7	FUCA2	DBNDD1	SEMA3F	CFTR	CYP51A1	...	SCRIB	WIZ	BPIFB4	LDB1	WIZ	TSGA10	RFX7	SWSAP1	MSANTD2	SVIL
Database_ID	ENSP00000000233.5	ENSP00000000412.3	ENSP00000000442.6	ENSP00000001008.4	ENSP00000002125.4	ENSP00000002165.5	ENSP00000002501.6	ENSP00000002829.3	ENSP00000003084.6	ENSP00000003100.8	...	ENSP00000501177.1	ENSP00000501256.3	ENSP00000501266.1	ENSP00000501277.1	ENSP00000501300.1	ENSP00000501312.1	ENSP00000501317.1	ENSP00000501355.1	ENSP00000501466.1	ENSP00000501521.1
Patient_ID
C3L-00006	-0.056513	0.016557	0.002569	0.389819	0.603610	-0.332543	-0.790426	NaN	0.822732	0.039134	...	0.161720	-0.884807	NaN	0.268247	0.125392	-0.880833	0.108554	0.107413	-0.085833	NaN
C3L-00008	0.549959	-0.206129	0.905784	-0.303631	0.018767	0.503513	0.950955	0.080142	NaN	-0.063213	...	NaN	0.054284	NaN	-0.106450	0.380557	-0.756099	0.264611	0.044423	-0.248319	-1.206596
C3L-00032	0.088681	-0.154447	-0.190515	0.170753	0.196356	0.544194	-0.179078	NaN	NaN	0.377405	...	-1.086905	0.055991	NaN	-0.021986	-0.229645	1.923986	NaN	-0.176694	-0.332384	-1.330653
C3L-00084	-0.846555	0.027740	NaN	0.178700	0.264054	-0.183548	0.077215	-0.247164	0.152277	-0.279549	...	-0.125796	0.944212	NaN	0.917409	0.026862	-0.885976	-0.006510	-0.014162	0.365158	NaN
C3L-00090	0.539019	0.956619	-0.039516	0.323656	0.064605	0.173433	-0.524325	-0.038590	-0.311486	0.309905	...	0.853362	-0.716947	NaN	-0.286277	-0.046076	0.089645	-0.444506	-0.072531	-0.463495	NaN

5 rows × 12662 columns

How do I access specific columns in a dataframe?¶

You'll probably want to get a feel for what data is in the dataframes you load. For example, say we want to know what kind of data is included in our proteomics dataframe. Each column in that dataframe contains the proteomics data for a different protein.

You can view a list of column names (which is a list of protein names, in the case of the proteomics dataframe) by appending .columns to the end of a dataframe variable. (If you wish to see all of the column names, even if there are a lot, append .columns.values to the dataframe variable.)

In [5]:

proteomics.columns

Out[5]:

MultiIndex([(   'ARF5', 'ENSP00000000233.5'),
            (   'M6PR', 'ENSP00000000412.3'),
            (  'ESRRA', 'ENSP00000000442.6'),
            (  'FKBP4', 'ENSP00000001008.4'),
            ('NDUFAF7', 'ENSP00000002125.4'),
            (  'FUCA2', 'ENSP00000002165.5'),
            ( 'DBNDD1', 'ENSP00000002501.6'),
            ( 'SEMA3F', 'ENSP00000002829.3'),
            (   'CFTR', 'ENSP00000003084.6'),
            ('CYP51A1', 'ENSP00000003100.8'),
            ...
            (  'SCRIB', 'ENSP00000501177.1'),
            (    'WIZ', 'ENSP00000501256.3'),
            ( 'BPIFB4', 'ENSP00000501266.1'),
            (   'LDB1', 'ENSP00000501277.1'),
            (    'WIZ', 'ENSP00000501300.1'),
            ( 'TSGA10', 'ENSP00000501312.1'),
            (   'RFX7', 'ENSP00000501317.1'),
            ( 'SWSAP1', 'ENSP00000501355.1'),
            ('MSANTD2', 'ENSP00000501466.1'),
            (   'SVIL', 'ENSP00000501521.1')],
           names=['Name', 'Database_ID'], length=12662)

To access specific column (which is a specific protein's data, in the case of the proteomics dataframe), slice the column out of the dataframe using either of the following methods:

proteomics["A1BG"]

or

proteomics.A1BG

Both return the column as a pandas series. The first method is useful when the name of the column you want is stored as a string variable.

In [6]:

protein = "A1BG"
A1BG_col = proteomics[protein]
A1BG_col.head()

Out[6]:

Database_ID	ENSP00000263100.2
Patient_ID
C3L-00006	-1.121101
C3L-00008	-0.798504
C3L-00032	-0.577203
C3L-00084	1.612713
C3L-00090	-1.350755

This dataframe["col_name"] syntax also allows for selection of multiple columns by entering a list of column names.

In [7]:

proteins = ["A1BG","PTEN","TP53"]
selected_prot = proteomics[proteins]
selected_prot.head()

Out[7]:

Name	A1BG	PTEN	TP53
Database_ID	ENSP00000263100.2	ENSP00000361021.3	ENSP00000269305.4	ENSP00000352610.4
Patient_ID
C3L-00006	-1.121101	-0.531482	NaN	-0.218863
C3L-00008	-0.798504	-0.730528	NaN	0.147008
C3L-00032	-0.577203	-1.029402	NaN	-1.149320
C3L-00084	1.612713	-0.587541	1.263191	0.967249
C3L-00090	-1.350755	0.591150	NaN	-0.701003

How do I access specific rows in a dataframe?¶

You can access specific rows in a dataframe (which are specific samples, in the case of the CPTAC data) using the dataframe's .iloc (by row number) or .loc (by row name) method, which both return a pandas Series if you select one row, and a pandas DataFrame if you select multiple rows.

In [8]:

proteomics.iloc[0:5]

Out[8]:

Name	ARF5	M6PR	ESRRA	FKBP4	NDUFAF7	FUCA2	DBNDD1	SEMA3F	CFTR	CYP51A1	...	SCRIB	WIZ	BPIFB4	LDB1	WIZ	TSGA10	RFX7	SWSAP1	MSANTD2	SVIL
Database_ID	ENSP00000000233.5	ENSP00000000412.3	ENSP00000000442.6	ENSP00000001008.4	ENSP00000002125.4	ENSP00000002165.5	ENSP00000002501.6	ENSP00000002829.3	ENSP00000003084.6	ENSP00000003100.8	...	ENSP00000501177.1	ENSP00000501256.3	ENSP00000501266.1	ENSP00000501277.1	ENSP00000501300.1	ENSP00000501312.1	ENSP00000501317.1	ENSP00000501355.1	ENSP00000501466.1	ENSP00000501521.1
Patient_ID
C3L-00006	-0.056513	0.016557	0.002569	0.389819	0.603610	-0.332543	-0.790426	NaN	0.822732	0.039134	...	0.161720	-0.884807	NaN	0.268247	0.125392	-0.880833	0.108554	0.107413	-0.085833	NaN
C3L-00008	0.549959	-0.206129	0.905784	-0.303631	0.018767	0.503513	0.950955	0.080142	NaN	-0.063213	...	NaN	0.054284	NaN	-0.106450	0.380557	-0.756099	0.264611	0.044423	-0.248319	-1.206596
C3L-00032	0.088681	-0.154447	-0.190515	0.170753	0.196356	0.544194	-0.179078	NaN	NaN	0.377405	...	-1.086905	0.055991	NaN	-0.021986	-0.229645	1.923986	NaN	-0.176694	-0.332384	-1.330653
C3L-00084	-0.846555	0.027740	NaN	0.178700	0.264054	-0.183548	0.077215	-0.247164	0.152277	-0.279549	...	-0.125796	0.944212	NaN	0.917409	0.026862	-0.885976	-0.006510	-0.014162	0.365158	NaN
C3L-00090	0.539019	0.956619	-0.039516	0.323656	0.064605	0.173433	-0.524325	-0.038590	-0.311486	0.309905	...	0.853362	-0.716947	NaN	-0.286277	-0.046076	0.089645	-0.444506	-0.072531	-0.463495	NaN

5 rows × 12662 columns

In [9]:

S001_row = proteomics.loc["C3L-00006"]
S001_row.head()

Out[9]:

Name     Database_ID      
ARF5     ENSP00000000233.5   -0.056513
M6PR     ENSP00000000412.3    0.016557
ESRRA    ENSP00000000442.6    0.002569
FKBP4    ENSP00000001008.4    0.389819
NDUFAF7  ENSP00000002125.4    0.603610
Name: C3L-00006, dtype: float64

How do I access specific rows and columns?¶

In addition to selecting specific rows, you can also use .loc to select a subset of rows and columns, using lists.

In [10]:

samples = ["C3L-00006","C3L-00032","C3L-00413"]
proteins = ["A1BG","PTEN","TP53"]
proteomics.loc[samples, proteins]

Out[10]:

Name	A1BG	PTEN	TP53
Database_ID	ENSP00000263100.2	ENSP00000361021.3	ENSP00000269305.4	ENSP00000352610.4
Patient_ID
C3L-00006	-1.121101	-0.531482	NaN	-0.218863
C3L-00032	-0.577203	-1.029402	NaN	-1.149320
C3L-00413	-0.235120	-1.355743	-1.167929	-0.491297

How can I search using conditional statements?¶

There are a variety of ways to use boolean statements to traverse a dataframe. A common way is to pass a boolean statement that selects the data you want to the .loc function. For example, if we want to see all the data for samples that have a positive protein expression level for the A1BG protein, we would pass the .loc function the boolean statement asking for rows containing values above zero for the A1BG column.

.loc has many functionalities. For a full list, see pandas documentation for .loc and indexing and slicing.

In [11]:

a1bg_positive = proteomics.loc[proteomics["A1BG"]['ENSP00000263100.2'] > 0]
a1bg_positive.head()

Out[11]:

Name	ARF5	M6PR	ESRRA	FKBP4	NDUFAF7	FUCA2	DBNDD1	SEMA3F	CFTR	CYP51A1	...	SCRIB	WIZ	BPIFB4	LDB1	WIZ	TSGA10	RFX7	SWSAP1	MSANTD2	SVIL
Database_ID	ENSP00000000233.5	ENSP00000000412.3	ENSP00000000442.6	ENSP00000001008.4	ENSP00000002125.4	ENSP00000002165.5	ENSP00000002501.6	ENSP00000002829.3	ENSP00000003084.6	ENSP00000003100.8	...	ENSP00000501177.1	ENSP00000501256.3	ENSP00000501266.1	ENSP00000501277.1	ENSP00000501300.1	ENSP00000501312.1	ENSP00000501317.1	ENSP00000501355.1	ENSP00000501466.1	ENSP00000501521.1
Patient_ID
C3L-00084	-0.846555	0.027740	NaN	0.178700	0.264054	-0.183548	0.077215	-0.247164	0.152277	-0.279549	...	-0.125796	0.944212	NaN	0.917409	0.026862	-0.885976	-0.006510	-0.014162	0.365158	NaN
C3L-00771	-0.367256	-0.238946	NaN	-0.401743	-0.050914	-0.705455	0.113656	-0.128739	-0.740824	-0.389428	...	-0.270607	0.293873	NaN	0.070451	-0.073260	-1.095170	0.501484	0.367101	0.557083	NaN
C3L-00938	-0.065230	-0.331508	0.508535	-0.195992	-0.200816	-0.095191	0.207192	0.404246	0.485247	-0.651001	...	-0.893192	0.637911	NaN	0.200598	-0.129226	-0.073511	-0.337527	NaN	0.156450	0.506879
C3L-01252	0.227334	-0.015206	0.380634	0.169788	-0.007953	0.272649	-0.202259	0.035179	NaN	-0.509328	...	-0.553422	-0.179688	NaN	0.181879	0.076132	0.125096	NaN	0.721806	0.349941	-0.548896
C3L-01253	-0.430675	-0.737361	0.555245	0.024818	-0.094097	-0.327127	0.411960	NaN	NaN	-1.126608	...	-0.701504	0.296178	NaN	0.272620	0.280234	-0.947285	0.300586	0.138227	0.476971	1.063495

5 rows × 12662 columns

For another example, suppose we wanted to separate clinical information for white or female patients, as recorded in the race/sex columns respectively.

In [12]:

clinical = en.get_clinical("mssm")
white_clinical = clinical.loc[clinical["race"] == "White"]
female_clinical = clinical.loc[clinical["sex"] == "Female"]

In [13]:

white_clinical.head()

Out[13]:

Name	tumor_code	discovery_study	type_of_analyzed_samples	confirmatory_study	type_of_analyzed_samples	age	sex	race	ethnicity	ethnicity_race_ancestry_identified	...	additional_treatment_pharmaceutical_therapy_for_new_tumor	additional_treatment_immuno_for_new_tumor	number_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_loco-regional	number_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_metastasis	Recurrence-free survival, days	Recurrence-free survival from collection, days	Recurrence status (1, yes; 0, no)	Overall survival, days	Overall survival from collection, days	Survival status (1, dead; 0, alive)
Patient_ID
C3L-00006	UCEC	Yes	Tumor_and_Normal	NaN	NaN	64	Female	White	Not Hispanic or Latino	White	...	NaN	NaN	NaN	NaN	NaN	NaN	0	737.0	737.0	0.0
C3L-00008	UCEC	Yes	Tumor	NaN	NaN	58	Female	White	Not Hispanic or Latino	White	...	NaN	NaN	NaN	NaN	NaN	NaN	0	898.0	898.0	0.0
C3L-00032	UCEC	Yes	Tumor	NaN	NaN	50	Female	White	Not Hispanic or Latino	White	...	NaN	NaN	NaN	NaN	NaN	NaN	0	1710.0	1710.0	0.0
C3L-00084	UCEC	Yes	Tumor	NaN	NaN	74	Female	White	Not Hispanic or Latino	White	...	NaN	NaN	NaN	NaN	NaN	NaN	0	335.0	335.0	0.0
C3L-00090	UCEC	Yes	Tumor	NaN	NaN	75	Female	White	Not Hispanic or Latino	White	...	Yes	No	NaN	NaN	50.0	56.0	1	1281.0	1287.0	1.0

5 rows × 124 columns

In [14]:

female_clinical.head()

Out[14]:

Name	tumor_code	discovery_study	type_of_analyzed_samples	confirmatory_study	type_of_analyzed_samples	age	sex	race	ethnicity	ethnicity_race_ancestry_identified	...	additional_treatment_pharmaceutical_therapy_for_new_tumor	additional_treatment_immuno_for_new_tumor	number_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_loco-regional	number_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_metastasis	Recurrence-free survival, days	Recurrence-free survival from collection, days	Recurrence status (1, yes; 0, no)	Overall survival, days	Overall survival from collection, days	Survival status (1, dead; 0, alive)
Patient_ID
C3L-00006	UCEC	Yes	Tumor_and_Normal	NaN	NaN	64	Female	White	Not Hispanic or Latino	White	...	NaN	NaN	NaN	NaN	NaN	NaN	0	737.0	737.0	0.0
C3L-00008	UCEC	Yes	Tumor	NaN	NaN	58	Female	White	Not Hispanic or Latino	White	...	NaN	NaN	NaN	NaN	NaN	NaN	0	898.0	898.0	0.0
C3L-00032	UCEC	Yes	Tumor	NaN	NaN	50	Female	White	Not Hispanic or Latino	White	...	NaN	NaN	NaN	NaN	NaN	NaN	0	1710.0	1710.0	0.0
C3L-00084	UCEC	Yes	Tumor	NaN	NaN	74	Female	White	Not Hispanic or Latino	White	...	NaN	NaN	NaN	NaN	NaN	NaN	0	335.0	335.0	0.0
C3L-00090	UCEC	Yes	Tumor	NaN	NaN	75	Female	White	Not Hispanic or Latino	White	...	Yes	No	NaN	NaN	50.0	56.0	1	1281.0	1287.0	1.0

5 rows × 124 columns

How do I export a dataframe to a file?¶

If you wish to export a dataframe to a file, call the dataframe's built-in to_csv method, specifying the path you wish to save to, and the separator you wish to use:

In [15]:

clinical = en.get_clinical('mssm')
clinical.to_csv(path_or_buf="clinical_df.tsv", sep='\t')

Tutorial 2: Accessing dataframes and using them with pandas¶

How do I install the package?¶

How do I import the package once it has been installed?¶

Can I use multiple datasets at the same time?¶

How do I access a particular dataframe?¶

How do I access specific columns in a dataframe?¶

How do I access specific rows in a dataframe?¶

How do I access specific rows and columns?¶

How can I search using conditional statements?¶

How do I export a dataframe to a file?¶

Tutorial 2: Accessing dataframes and using them with `pandas`¶