Tutorial 4: Multi-level column indices (MultiIndex)¶

A MultiIndex is an index with multiple, hierarchical levels. We use a multiindex for the column headers of a dataframe. This allows each column to have multiple keys associated with it--one key for each level.

The multiindex is a native part of the pandas package. For more documentation, see their MultiIndex / advanced indexing page.

MultiIndexes make it easy to associate multiple identifiers with a single column. In this tutorial, we'll work with phosphoproteomics data from the endometrial cancer dataset.

Phosphoproteomics is the large-scale study of proteins that have undergone a process called phosphorylation. This is particularly important in the context of disease, as the process is often disrupted in cancer cells.

Let's start by importing our required package and loading the endometrial dataset.

In [1]:
import cptac
en = cptac.Ucec()

Let's see the data sources available in our dataset.

In [2]:
en.list_data_sources()
Out[2]:
Data type Available sources
0 CNV [bcm, washu]
1 circular_RNA [bcm]
2 miRNA [bcm, washu]
3 proteomics [bcm, umich]
4 transcriptomics [bcm, broad, washu]
5 ancestry_prediction [harmonized]
6 somatic_mutation [harmonized, washu]
7 clinical [mssm]
8 follow-up [mssm]
9 medical_history [mssm]
10 acetylproteomics [umich]
11 phosphoproteomics [umich]
12 cibersort [washu]
13 hla_typing [washu]
14 tumor_purity [washu]
15 xcell [washu]

Here, we see different types of omics data (e.g., CNV, proteomics, phosphoproteomics) available from various sources (e.g., washu, umich).

We will retrieve phosphoproteomics, proteomics and CNV data from the respective sources.

In [3]:
#Some data may take several minutes to load
en.get_proteomics('umich')
en.get_phosphoproteomics('umich')
en.get_CNV('washu')
cptac warning: Your version of cptac (1.5.1) is out-of-date. Latest is 1.5.0. Please run 'pip install --upgrade cptac' to update it. (C:\Users\sabme\anaconda3\lib\threading.py, line 910)
Out[3]:
Name A1BG A1CF A2M A2ML1 A3GALT2 A4GALT A4GNT AAAS AACS AADAC ... ZW10 ZWILCH ZWINT ZXDC ZYG11A ZYG11B ZYX ZZEF1 ZZZ3 pk
Database_ID ENSG00000121410.10 ENSG00000148584.13 ENSG00000175899.13 ENSG00000166535.18 ENSG00000184389.9 ENSG00000128274.14 ENSG00000118017.3 ENSG00000094914.11 ENSG00000081760.15 ENSG00000114771.12 ... ENSG00000086827.7 ENSG00000174442.10 ENSG00000122952.15 ENSG00000070476.13 ENSG00000203995.8 ENSG00000162378.11 ENSG00000159840.14 ENSG00000074755.13 ENSG00000036549.11 ENSG00000091436.15
Patient_ID
C3L-00006 -0.00659 -0.01982 -0.01402 -0.01402 -0.01418 -0.00839 -0.01305 -0.01402 -0.01402 -0.01305 ... -0.01641 -0.00963 -0.01982 -0.01305 -0.01418 -0.01418 -0.01897 -0.00529 -0.01418 -0.01480
C3L-00008 0.02578 0.00726 0.01350 0.01350 0.00732 0.01642 0.01005 0.01225 0.01225 0.01005 ... 0.01583 0.01844 0.00726 0.01005 0.00732 0.00732 0.01200 0.01969 0.00732 0.01121
C3L-00032 0.01262 0.00425 -0.00275 -0.00275 0.00166 0.00549 -0.00038 -0.00275 -0.00275 -0.00038 ... -0.00305 0.00214 0.00425 -0.00038 0.00166 0.00166 0.01408 0.00683 0.00166 0.00208
C3L-00090 0.00100 0.41191 -0.02299 -0.02299 -0.02436 -0.01198 -0.03307 -0.02299 -0.02299 -0.03307 ... -0.01982 -0.02071 0.41191 -0.61621 -0.02436 -0.02436 -0.02182 -0.00336 -0.02436 -0.02548
C3L-00098 1.01075 0.27221 -0.39802 -0.39802 0.00226 0.31684 0.31108 -0.38507 -0.41089 0.31108 ... -0.39843 -0.38591 0.27221 0.31108 0.01711 0.01711 -0.01434 -0.34344 -0.01427 0.53267
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
C3N-01520 -0.05661 -0.06508 -0.06174 -0.06174 -0.05318 -0.05054 -0.06413 -0.06174 -0.06174 -0.06413 ... -0.05110 -0.05228 -0.06508 -0.06413 -0.05318 -0.05318 0.37759 -0.04914 -0.05318 -0.05533
C3N-01521 -0.36477 -0.00244 -0.06953 -0.06953 0.38241 -0.34150 0.31502 -0.06953 -0.04038 0.31502 ... -0.05879 -0.19572 -0.00244 0.79082 0.38241 0.38241 0.58054 -0.34785 0.38241 0.00842
C3N-01537 0.09203 0.00535 0.08807 0.08807 -0.19341 -0.29275 0.08943 0.08807 0.07883 0.08943 ... -0.11330 0.07121 0.00535 0.08943 -0.10289 -0.10289 -0.01151 -0.30164 -0.10289 0.02389
C3N-01802 -0.06298 -0.04134 0.05070 0.05070 -0.00420 -0.18427 0.17981 0.08711 -0.12670 0.17981 ... 0.14454 0.05509 -0.04134 0.17981 -0.12128 -0.12128 -0.06015 0.14747 -0.13738 -0.01938
C3N-01825 0.12974 0.03784 0.11400 0.11400 0.04662 0.01662 0.13939 0.11400 -0.00039 0.13939 ... 0.02229 -0.00842 0.03784 0.13939 0.04662 0.04662 0.03222 -0.02923 0.04662 0.02880

95 rows × 18919 columns

Now that our data is loaded, let's take a look at the structure of the phosphoproteomics data.

In [4]:
# Display first 5 rows of phosphoproteomics data
en.get_phosphoproteomics('umich').head()
Out[4]:
Name ARF5 M6PR ESRRA ... SCRIB TSGA10 SVIL
Site S137 S267 S19 S22 S19S22 T31 S19S22 S19S22S27 S19S22T31 S27 ... S1575T1588S1594 S1594 S11 S173 S213 S391 S779 S101 S296 S459
Peptide QDMPNAMPVsELTDK GVGDDQLGEEsEERDDHLLPM AEPAsPDSPK AEPASPDsPK AEPAsPDsPK AEPASPDSPKGSSETEtEPPVALAPGPAPTR AEPAsPDsPKGSSETETEPPVALAPGPAPTR AEPAsPDsPKGSsETETEPPVALAPGPAPTR AEPAsPDsPKGSSETEtEPPVALAPGPAPTR GSsETETEPPVALAPGPAPTR ... LAEAPSPAPTPsPTPVEDLGPQTStSPGRLsPDFAEELR LsPDFAEELR sPGRDPELQVEAAEVTTK sPSRLDSFVK RPsPTAR AMDTEsELGR GLDRsLEENLCYR;GLDRsLEENLCYRDF EVVSSQVDDLTsHNEHLCK DSEGDTPsLINWPSSK LPsPTVAR
Database_ID ENSP00000000233.5 ENSP00000000412.3 ENSP00000000442.6 ENSP00000000442.6 ENSP00000000442.6 ENSP00000000442.6 ENSP00000000442.6 ENSP00000000442.6 ENSP00000000442.6 ENSP00000000442.6 ... ENSP00000501177.1 ENSP00000501177.1 ENSP00000501312.1 ENSP00000501312.1 ENSP00000501312.1 ENSP00000501312.1 ENSP00000501312.1 ENSP00000501312.1 ENSP00000501521.1 ENSP00000501521.1
Patient_ID
C3L-00006 NaN 0.573633 NaN NaN 0.304721 NaN NaN NaN NaN NaN ... NaN 0.667426 NaN 0.905606 NaN -0.069911 -0.584774 NaN -0.561657 -0.652457
C3L-00008 0.003632 -0.393734 NaN NaN 0.789193 NaN NaN NaN NaN NaN ... NaN NaN NaN -0.488427 NaN NaN NaN NaN -0.431599 -1.079638
C3L-00032 NaN -0.211020 NaN NaN NaN NaN NaN NaN NaN 0.131605 ... NaN 0.104862 NaN NaN NaN NaN NaN NaN NaN -1.439041
C3L-00084 NaN 0.220473 NaN NaN -0.290506 NaN NaN NaN NaN NaN ... NaN 0.399718 -0.875016 -0.579824 NaN NaN -0.505807 NaN NaN -1.521725
C3L-00090 NaN 0.161496 NaN NaN 0.708453 NaN 0.405402 1.253045 NaN 0.265813 ... NaN 1.069439 NaN 0.510268 -1.889144 NaN -0.592203 NaN NaN -1.126482

5 rows × 85661 columns

Join functions with multiindices¶

The join functions have been written to handle multiindices. More information on the join functions can be found in the joining_dataframes tutorial. An example of joining a multiindexed dataframe (in this case phosphoproteomics) with a non multiindexed dataframe (in this case CNV) is below.

In [5]:
phospho_and_CNV = en.join_omics_to_omics(df1_name="CNV", df2_name="phosphoproteomics", df1_source='washu', df2_source = 'umich')
phospho_and_CNV.head()
Out[5]:
Name A1BG_washu_CNV A1CF_washu_CNV A2M_washu_CNV A2ML1_washu_CNV A3GALT2_washu_CNV A4GALT_washu_CNV A4GNT_washu_CNV AAAS_washu_CNV AACS_washu_CNV AADAC_washu_CNV ... SCRIB_umich_phosphoproteomics TSGA10_umich_phosphoproteomics SVIL_umich_phosphoproteomics
Site ... S1575T1588S1594 S1594 S11 S173 S213 S391 S779 S101 S296 S459
Peptide ... LAEAPSPAPTPsPTPVEDLGPQTStSPGRLsPDFAEELR LsPDFAEELR sPGRDPELQVEAAEVTTK sPSRLDSFVK RPsPTAR AMDTEsELGR GLDRsLEENLCYR;GLDRsLEENLCYRDF EVVSSQVDDLTsHNEHLCK DSEGDTPsLINWPSSK LPsPTVAR
Patient_ID
C3L-00006 -0.00659 -0.01982 -0.01402 -0.01402 -0.01418 -0.00839 -0.01305 -0.01402 -0.01402 -0.01305 ... NaN 0.667426 NaN 0.905606 NaN -0.069911 -0.584774 NaN -0.561657 -0.652457
C3L-00008 0.02578 0.00726 0.01350 0.01350 0.00732 0.01642 0.01005 0.01225 0.01225 0.01005 ... NaN NaN NaN -0.488427 NaN NaN NaN NaN -0.431599 -1.079638
C3L-00032 0.01262 0.00425 -0.00275 -0.00275 0.00166 0.00549 -0.00038 -0.00275 -0.00275 -0.00038 ... NaN 0.104862 NaN NaN NaN NaN NaN NaN NaN -1.439041
C3L-00084 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN 0.399718 -0.875016 -0.579824 NaN NaN -0.505807 NaN NaN -1.521725
C3L-00090 0.00100 0.41191 -0.02299 -0.02299 -0.02436 -0.01198 -0.03307 -0.02299 -0.02299 -0.03307 ... NaN 1.069439 NaN 0.510268 -1.889144 NaN -0.592203 NaN NaN -1.126482

5 rows × 104580 columns

Since the C3L-00084 row doesn't have CNV data, it is filled in with NANs, so that it can be joined to the CNV dataframe.

How to select from multiindex¶

Selecting based on all levels¶

We can select single columns by passing the proper keys for all levels of the multiindex. For example, to get the proteomics for ARF5, we'd do the following:

In [6]:
prot = en.get_proteomics('umich')
all_levels_selection = prot["ARF5"]

#Display the first 10 rows of the desired data
all_levels_selection.head(10)
Out[6]:
Database_ID ENSP00000000233.5
Patient_ID
C3L-00006 -0.056513
C3L-00008 0.549959
C3L-00032 0.088681
C3L-00084 -0.846555
C3L-00090 0.539019
C3L-00098 -0.017370
C3L-00136 0.230347
C3L-00137 0.191915
C3L-00139 -0.410142
C3L-00143 -0.170514

Selecting based on one level¶

We can easily select multiple columns from our multiindex dataframe, based on just the "Name" level of the multiindex:

In [7]:
gene1_filter = prot.columns.get_level_values("Name").str.startswith("A") # Select all columns where the gene starts with "A". This will grab every column where the key "Name" starts with AA
gene1_data = prot.loc[:, gene1_filter]
gene1_data.head()
Out[7]:
Name ARF5 AKAP11 ARHGEF5 APPBP2 AQR ACAP1 ANO8 AP3M2 ASNS ALDH3A2 ... ABCC2 ANKS1A AL034430.2 AP5Z1 ATP8B1 AKR1B15 AC004706.3 ATAD3B ANK2
Database_ID ENSP00000000233.5 ENSP00000025301.2 ENSP00000056217.5 ENSP00000083182.3 ENSP00000156471.5 ENSP00000158762.3 ENSP00000159087.4 ENSP00000174653.3 ENSP00000175506.4 ENSP00000176643.6 ... ENSP00000497274.1 ENSP00000497393.1 ENSP00000497510.1 ENSP00000497815.1 ENSP00000497896.1 ENSP00000498877.1 ENSP00000499350.1 ENSP00000500094.1 ENSP00000500102.1 ENSP00000500937.1
Patient_ID
C3L-00006 -0.056513 -0.385278 0.188877 -0.059319 0.276154 -0.252270 1.280740 0.086567 0.334008 1.048464 ... NaN 0.454359 1.346643 -0.186762 -0.361594 NaN NaN NaN NaN NaN
C3L-00008 0.549959 -0.491451 0.277281 0.225857 0.400321 -0.485365 NaN -0.544367 1.634042 -0.848812 ... NaN 0.438931 0.250021 0.005658 1.065706 -0.310341 0.060549 NaN -0.62873 NaN
C3L-00032 0.088681 0.203899 0.261918 0.192734 -0.244333 0.169655 NaN 0.223638 0.358561 -0.314030 ... NaN NaN 0.411541 0.043151 0.461451 NaN NaN 0.300528 NaN NaN
C3L-00084 -0.846555 -0.286751 -0.468015 0.249142 0.013797 -0.606966 -0.303256 -0.398076 1.017079 -0.385280 ... NaN 1.423702 -0.524652 0.111429 0.172027 NaN NaN 0.102475 NaN NaN
C3L-00090 0.539019 -0.098589 0.605331 0.571185 0.178541 -0.567123 NaN 0.053186 0.390269 1.059128 ... -0.737756 NaN 0.580644 -0.108808 0.429643 -0.218494 NaN 0.314156 NaN NaN

5 rows × 1001 columns

Selecting based on a different level of the multiindex¶

We can also select based on one of the inner levels of the multiindex. For example, to get data for all tyrosine phosphorylation sites:

In [8]:
y_site_filter = prot.columns.get_level_values("Database_ID").str.contains("ENSp") # Create a boolean filter selecting all columns where the Site level contains a "Y"

y_sites = prot.loc[:, y_site_filter] # Select the columns
y_sites.head()
Out[8]:
Patient_ID
C3L-00006
C3L-00008
C3L-00032
C3L-00084
C3L-00090

How to use cptac.utils.reduce_multiindex()¶

To make it easier to work with multi-level indices, we provide the reduce_multiindex function, available for import from the cptac.utils submodule. It can both drop levels from a multiindex, and "flatten" a multi-level index into a single-level index by concatenating the keys from multiple levels into a single key for each column.

In [9]:
import cptac.utils as ut

Dropping Levels¶

We can drop levels based on index or name. We can also drop single or multiple levels at once. Note that it will warn you if duplicate column key combinations arise due to dropping levels.

Dropping by index or name¶

In [10]:
ut.reduce_multiindex(df=prot, levels_to_drop="Database_ID").head()
cptac warning: Due to dropping the specified levels, dataframe now has 1299 duplicated column headers. (C:\Users\sabme\AppData\Local\Temp\ipykernel_21892\2675409348.py, line 1)
Out[10]:
Name ARF5 M6PR ESRRA FKBP4 NDUFAF7 FUCA2 DBNDD1 SEMA3F CFTR CYP51A1 ... SCRIB WIZ BPIFB4 LDB1 WIZ TSGA10 RFX7 SWSAP1 MSANTD2 SVIL
Patient_ID
C3L-00006 -0.056513 0.016557 0.002569 0.389819 0.603610 -0.332543 -0.790426 NaN 0.822732 0.039134 ... 0.161720 -0.884807 NaN 0.268247 0.125392 -0.880833 0.108554 0.107413 -0.085833 NaN
C3L-00008 0.549959 -0.206129 0.905784 -0.303631 0.018767 0.503513 0.950955 0.080142 NaN -0.063213 ... NaN 0.054284 NaN -0.106450 0.380557 -0.756099 0.264611 0.044423 -0.248319 -1.206596
C3L-00032 0.088681 -0.154447 -0.190515 0.170753 0.196356 0.544194 -0.179078 NaN NaN 0.377405 ... -1.086905 0.055991 NaN -0.021986 -0.229645 1.923986 NaN -0.176694 -0.332384 -1.330653
C3L-00084 -0.846555 0.027740 NaN 0.178700 0.264054 -0.183548 0.077215 -0.247164 0.152277 -0.279549 ... -0.125796 0.944212 NaN 0.917409 0.026862 -0.885976 -0.006510 -0.014162 0.365158 NaN
C3L-00090 0.539019 0.956619 -0.039516 0.323656 0.064605 0.173433 -0.524325 -0.038590 -0.311486 0.309905 ... 0.853362 -0.716947 NaN -0.286277 -0.046076 0.089645 -0.444506 -0.072531 -0.463495 NaN

5 rows × 12662 columns

Dropping single or multiple levels at once¶

By passing a list (or array-like) to levels_to drop, we can drop multiple levels of the multiindex at the same time. Note that we must leave at least one existing level.

We will show this with the colon data.

In [11]:
colon = cptac.Coad()
prot = colon.get_proteomics('umich')
prot.head()
Out[11]:
Name ARF5 M6PR ESRRA FKBP4 NDUFAF7 FUCA2 CFTR CYP51A1 USP28 TMEM176A ... TMUB1 CSNK1A1 MICAL2 ANK2 SEPTIN7 ATAD3B ETNK1 MYO6 WIZ HSPA12A
Database_ID ENSP00000000233.5 ENSP00000000412.3 ENSP00000000442.6 ENSP00000001008.4 ENSP00000002125.4 ENSP00000002165.5 ENSP00000003084.6 ENSP00000003100.8 ENSP00000003302.4 ENSP00000004103.3 ... ENSP00000499339.1 ENSP00000499757.1 ENSP00000499778.1 ENSP00000499869.1 ENSP00000499937.1 ENSP00000500094.1 ENSP00000500633.1 ENSP00000500710.1 ENSP00000501300.1 ENSP00000501491.1
Patient_ID
01CO005 -0.203037 -0.223341 -0.283633 -0.612614 0.514855 -0.824026 NaN 0.045383 NaN -0.248511 ... NaN NaN -0.042548 NaN NaN 0.925011 -0.173468 -0.180521 0.139707 -0.882283
01CO006 0.188931 0.544620 NaN -0.571640 -0.209734 0.799090 NaN -0.338493 -0.042567 NaN ... -0.411664 -0.454109 -0.725892 NaN NaN -0.707588 -0.846624 0.329813 -0.311147 -0.446358
01CO008 0.404810 -0.246523 -0.053940 0.252995 0.190861 0.101419 -0.502876 0.627060 0.089815 -0.106411 ... 0.192279 -0.558236 -0.093708 -1.874293 -0.248307 -0.899186 -0.526260 0.668713 0.109366 -1.125296
01CO013 -0.276982 -0.017659 NaN -0.455055 0.500686 -0.350366 NaN 0.263168 0.683830 NaN ... 0.220231 NaN 0.241860 -3.939263 NaN 0.514931 -0.078267 0.122032 0.130764 -1.146911
01CO014 -0.160155 0.100022 0.259696 0.341345 -0.310265 0.095461 -0.745855 1.006614 NaN NaN ... -0.198671 0.226146 0.036229 NaN NaN 1.189468 0.117736 0.586529 -0.006767 -1.106068

5 rows × 9457 columns

In [12]:
# Drop level 'Name'
ut.reduce_multiindex(df=prot, levels_to_drop='Name').head()
#You can also pass a list in order to drop multiple levels
Out[12]:
Database_ID ENSP00000000233.5 ENSP00000000412.3 ENSP00000000442.6 ENSP00000001008.4 ENSP00000002125.4 ENSP00000002165.5 ENSP00000003084.6 ENSP00000003100.8 ENSP00000003302.4 ENSP00000004103.3 ... ENSP00000499339.1 ENSP00000499757.1 ENSP00000499778.1 ENSP00000499869.1 ENSP00000499937.1 ENSP00000500094.1 ENSP00000500633.1 ENSP00000500710.1 ENSP00000501300.1 ENSP00000501491.1
Patient_ID
01CO005 -0.203037 -0.223341 -0.283633 -0.612614 0.514855 -0.824026 NaN 0.045383 NaN -0.248511 ... NaN NaN -0.042548 NaN NaN 0.925011 -0.173468 -0.180521 0.139707 -0.882283
01CO006 0.188931 0.544620 NaN -0.571640 -0.209734 0.799090 NaN -0.338493 -0.042567 NaN ... -0.411664 -0.454109 -0.725892 NaN NaN -0.707588 -0.846624 0.329813 -0.311147 -0.446358
01CO008 0.404810 -0.246523 -0.053940 0.252995 0.190861 0.101419 -0.502876 0.627060 0.089815 -0.106411 ... 0.192279 -0.558236 -0.093708 -1.874293 -0.248307 -0.899186 -0.526260 0.668713 0.109366 -1.125296
01CO013 -0.276982 -0.017659 NaN -0.455055 0.500686 -0.350366 NaN 0.263168 0.683830 NaN ... 0.220231 NaN 0.241860 -3.939263 NaN 0.514931 -0.078267 0.122032 0.130764 -1.146911
01CO014 -0.160155 0.100022 0.259696 0.341345 -0.310265 0.095461 -0.745855 1.006614 NaN NaN ... -0.198671 0.226146 0.036229 NaN NaN 1.189468 0.117736 0.586529 -0.006767 -1.106068

5 rows × 9457 columns

Combining levels (Flattening)¶

We can combine levels of a multiindexed dataframe. When combined the levels will be sepereated by an underscore, by default. We could specify a different seperator using the sep parameter.

In [13]:
ut.reduce_multiindex(df=prot, flatten=True).head()
Out[13]:
Name ARF5_ENSP00000000233.5 M6PR_ENSP00000000412.3 ESRRA_ENSP00000000442.6 FKBP4_ENSP00000001008.4 NDUFAF7_ENSP00000002125.4 FUCA2_ENSP00000002165.5 CFTR_ENSP00000003084.6 CYP51A1_ENSP00000003100.8 USP28_ENSP00000003302.4 TMEM176A_ENSP00000004103.3 ... TMUB1_ENSP00000499339.1 CSNK1A1_ENSP00000499757.1 MICAL2_ENSP00000499778.1 ANK2_ENSP00000499869.1 SEPTIN7_ENSP00000499937.1 ATAD3B_ENSP00000500094.1 ETNK1_ENSP00000500633.1 MYO6_ENSP00000500710.1 WIZ_ENSP00000501300.1 HSPA12A_ENSP00000501491.1
Patient_ID
01CO005 -0.203037 -0.223341 -0.283633 -0.612614 0.514855 -0.824026 NaN 0.045383 NaN -0.248511 ... NaN NaN -0.042548 NaN NaN 0.925011 -0.173468 -0.180521 0.139707 -0.882283
01CO006 0.188931 0.544620 NaN -0.571640 -0.209734 0.799090 NaN -0.338493 -0.042567 NaN ... -0.411664 -0.454109 -0.725892 NaN NaN -0.707588 -0.846624 0.329813 -0.311147 -0.446358
01CO008 0.404810 -0.246523 -0.053940 0.252995 0.190861 0.101419 -0.502876 0.627060 0.089815 -0.106411 ... 0.192279 -0.558236 -0.093708 -1.874293 -0.248307 -0.899186 -0.526260 0.668713 0.109366 -1.125296
01CO013 -0.276982 -0.017659 NaN -0.455055 0.500686 -0.350366 NaN 0.263168 0.683830 NaN ... 0.220231 NaN 0.241860 -3.939263 NaN 0.514931 -0.078267 0.122032 0.130764 -1.146911
01CO014 -0.160155 0.100022 0.259696 0.341345 -0.310265 0.095461 -0.745855 1.006614 NaN NaN ... -0.198671 0.226146 0.036229 NaN NaN 1.189468 0.117736 0.586529 -0.006767 -1.106068

5 rows × 9457 columns

When flatteing levels , NaNs and empty strings will automitically be dropped.

In [14]:
phospho_and_CNV = en.join_omics_to_omics(df1_name="CNV", df2_name="phosphoproteomics", df1_source = 'washu', df2_source = 'umich')
phospho_and_CNV.head()

# Note that the CNV columns all have empty strings in the "Site" level of the columns,
# since the CNV data doesn't have any values for that.
Out[14]:
Name A1BG_washu_CNV A1CF_washu_CNV A2M_washu_CNV A2ML1_washu_CNV A3GALT2_washu_CNV A4GALT_washu_CNV A4GNT_washu_CNV AAAS_washu_CNV AACS_washu_CNV AADAC_washu_CNV ... SCRIB_umich_phosphoproteomics TSGA10_umich_phosphoproteomics SVIL_umich_phosphoproteomics
Site ... S1575T1588S1594 S1594 S11 S173 S213 S391 S779 S101 S296 S459
Peptide ... LAEAPSPAPTPsPTPVEDLGPQTStSPGRLsPDFAEELR LsPDFAEELR sPGRDPELQVEAAEVTTK sPSRLDSFVK RPsPTAR AMDTEsELGR GLDRsLEENLCYR;GLDRsLEENLCYRDF EVVSSQVDDLTsHNEHLCK DSEGDTPsLINWPSSK LPsPTVAR
Patient_ID
C3L-00006 -0.00659 -0.01982 -0.01402 -0.01402 -0.01418 -0.00839 -0.01305 -0.01402 -0.01402 -0.01305 ... NaN 0.667426 NaN 0.905606 NaN -0.069911 -0.584774 NaN -0.561657 -0.652457
C3L-00008 0.02578 0.00726 0.01350 0.01350 0.00732 0.01642 0.01005 0.01225 0.01225 0.01005 ... NaN NaN NaN -0.488427 NaN NaN NaN NaN -0.431599 -1.079638
C3L-00032 0.01262 0.00425 -0.00275 -0.00275 0.00166 0.00549 -0.00038 -0.00275 -0.00275 -0.00038 ... NaN 0.104862 NaN NaN NaN NaN NaN NaN NaN -1.439041
C3L-00084 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN 0.399718 -0.875016 -0.579824 NaN NaN -0.505807 NaN NaN -1.521725
C3L-00090 0.00100 0.41191 -0.02299 -0.02299 -0.02436 -0.01198 -0.03307 -0.02299 -0.02299 -0.03307 ... NaN 1.069439 NaN 0.510268 -1.889144 NaN -0.592203 NaN NaN -1.126482

5 rows × 104580 columns

In [15]:
ut.reduce_multiindex(df=phospho_and_CNV, flatten=True).head()
# Notice that the empty strings have been dropped
Out[15]:
Name A1BG_washu_CNV A1CF_washu_CNV A2M_washu_CNV A2ML1_washu_CNV A3GALT2_washu_CNV A4GALT_washu_CNV A4GNT_washu_CNV AAAS_washu_CNV AACS_washu_CNV AADAC_washu_CNV ... SCRIB_umich_phosphoproteomics_S1575T1588S1594_LAEAPSPAPTPsPTPVEDLGPQTStSPGRLsPDFAEELR SCRIB_umich_phosphoproteomics_S1594_LsPDFAEELR TSGA10_umich_phosphoproteomics_S11_sPGRDPELQVEAAEVTTK TSGA10_umich_phosphoproteomics_S173_sPSRLDSFVK TSGA10_umich_phosphoproteomics_S213_RPsPTAR TSGA10_umich_phosphoproteomics_S391_AMDTEsELGR TSGA10_umich_phosphoproteomics_S779_GLDRsLEENLCYR;GLDRsLEENLCYRDF TSGA10_umich_phosphoproteomics_S101_EVVSSQVDDLTsHNEHLCK SVIL_umich_phosphoproteomics_S296_DSEGDTPsLINWPSSK SVIL_umich_phosphoproteomics_S459_LPsPTVAR
Patient_ID
C3L-00006 -0.00659 -0.01982 -0.01402 -0.01402 -0.01418 -0.00839 -0.01305 -0.01402 -0.01402 -0.01305 ... NaN 0.667426 NaN 0.905606 NaN -0.069911 -0.584774 NaN -0.561657 -0.652457
C3L-00008 0.02578 0.00726 0.01350 0.01350 0.00732 0.01642 0.01005 0.01225 0.01225 0.01005 ... NaN NaN NaN -0.488427 NaN NaN NaN NaN -0.431599 -1.079638
C3L-00032 0.01262 0.00425 -0.00275 -0.00275 0.00166 0.00549 -0.00038 -0.00275 -0.00275 -0.00038 ... NaN 0.104862 NaN NaN NaN NaN NaN NaN NaN -1.439041
C3L-00084 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN 0.399718 -0.875016 -0.579824 NaN NaN -0.505807 NaN NaN -1.521725
C3L-00090 0.00100 0.41191 -0.02299 -0.02299 -0.02436 -0.01198 -0.03307 -0.02299 -0.02299 -0.03307 ... NaN 1.069439 NaN 0.510268 -1.889144 NaN -0.592203 NaN NaN -1.126482

5 rows × 104580 columns

Getting a single level index of tuples¶

You can also use reduce_multiindex to turn the multi-level column index into a single level index of tuples, with each value in a column's tuple corresponding to the column's value for that level of the index:

In [16]:
ut.reduce_multiindex(df=prot, tuples=True).head()
Out[16]:
(ARF5, ENSP00000000233.5) (M6PR, ENSP00000000412.3) (ESRRA, ENSP00000000442.6) (FKBP4, ENSP00000001008.4) (NDUFAF7, ENSP00000002125.4) (FUCA2, ENSP00000002165.5) (CFTR, ENSP00000003084.6) (CYP51A1, ENSP00000003100.8) (USP28, ENSP00000003302.4) (TMEM176A, ENSP00000004103.3) ... (TMUB1, ENSP00000499339.1) (CSNK1A1, ENSP00000499757.1) (MICAL2, ENSP00000499778.1) (ANK2, ENSP00000499869.1) (SEPTIN7, ENSP00000499937.1) (ATAD3B, ENSP00000500094.1) (ETNK1, ENSP00000500633.1) (MYO6, ENSP00000500710.1) (WIZ, ENSP00000501300.1) (HSPA12A, ENSP00000501491.1)
Patient_ID
01CO005 -0.203037 -0.223341 -0.283633 -0.612614 0.514855 -0.824026 NaN 0.045383 NaN -0.248511 ... NaN NaN -0.042548 NaN NaN 0.925011 -0.173468 -0.180521 0.139707 -0.882283
01CO006 0.188931 0.544620 NaN -0.571640 -0.209734 0.799090 NaN -0.338493 -0.042567 NaN ... -0.411664 -0.454109 -0.725892 NaN NaN -0.707588 -0.846624 0.329813 -0.311147 -0.446358
01CO008 0.404810 -0.246523 -0.053940 0.252995 0.190861 0.101419 -0.502876 0.627060 0.089815 -0.106411 ... 0.192279 -0.558236 -0.093708 -1.874293 -0.248307 -0.899186 -0.526260 0.668713 0.109366 -1.125296
01CO013 -0.276982 -0.017659 NaN -0.455055 0.500686 -0.350366 NaN 0.263168 0.683830 NaN ... 0.220231 NaN 0.241860 -3.939263 NaN 0.514931 -0.078267 0.122032 0.130764 -1.146911
01CO014 -0.160155 0.100022 0.259696 0.341345 -0.310265 0.095461 -0.745855 1.006614 NaN NaN ... -0.198671 0.226146 0.036229 NaN NaN 1.189468 0.117736 0.586529 -0.006767 -1.106068

5 rows × 9457 columns

Turning off warnings¶

If your multiindex operation creates duplicate column headers, or has no effect, reduce_multiindex will warn you. You can silence these warnings by passing True to the quiet parameter:

In [17]:
ut.reduce_multiindex(df=prot, levels_to_drop="Name").head()
Out[17]:
Database_ID ENSP00000000233.5 ENSP00000000412.3 ENSP00000000442.6 ENSP00000001008.4 ENSP00000002125.4 ENSP00000002165.5 ENSP00000003084.6 ENSP00000003100.8 ENSP00000003302.4 ENSP00000004103.3 ... ENSP00000499339.1 ENSP00000499757.1 ENSP00000499778.1 ENSP00000499869.1 ENSP00000499937.1 ENSP00000500094.1 ENSP00000500633.1 ENSP00000500710.1 ENSP00000501300.1 ENSP00000501491.1
Patient_ID
01CO005 -0.203037 -0.223341 -0.283633 -0.612614 0.514855 -0.824026 NaN 0.045383 NaN -0.248511 ... NaN NaN -0.042548 NaN NaN 0.925011 -0.173468 -0.180521 0.139707 -0.882283
01CO006 0.188931 0.544620 NaN -0.571640 -0.209734 0.799090 NaN -0.338493 -0.042567 NaN ... -0.411664 -0.454109 -0.725892 NaN NaN -0.707588 -0.846624 0.329813 -0.311147 -0.446358
01CO008 0.404810 -0.246523 -0.053940 0.252995 0.190861 0.101419 -0.502876 0.627060 0.089815 -0.106411 ... 0.192279 -0.558236 -0.093708 -1.874293 -0.248307 -0.899186 -0.526260 0.668713 0.109366 -1.125296
01CO013 -0.276982 -0.017659 NaN -0.455055 0.500686 -0.350366 NaN 0.263168 0.683830 NaN ... 0.220231 NaN 0.241860 -3.939263 NaN 0.514931 -0.078267 0.122032 0.130764 -1.146911
01CO014 -0.160155 0.100022 0.259696 0.341345 -0.310265 0.095461 -0.745855 1.006614 NaN NaN ... -0.198671 0.226146 0.036229 NaN NaN 1.189468 0.117736 0.586529 -0.006767 -1.106068

5 rows × 9457 columns

In [18]:
# No warning will be issued
ut.reduce_multiindex(df=prot, levels_to_drop="Name", quiet=True).head()
Out[18]:
Database_ID ENSP00000000233.5 ENSP00000000412.3 ENSP00000000442.6 ENSP00000001008.4 ENSP00000002125.4 ENSP00000002165.5 ENSP00000003084.6 ENSP00000003100.8 ENSP00000003302.4 ENSP00000004103.3 ... ENSP00000499339.1 ENSP00000499757.1 ENSP00000499778.1 ENSP00000499869.1 ENSP00000499937.1 ENSP00000500094.1 ENSP00000500633.1 ENSP00000500710.1 ENSP00000501300.1 ENSP00000501491.1
Patient_ID
01CO005 -0.203037 -0.223341 -0.283633 -0.612614 0.514855 -0.824026 NaN 0.045383 NaN -0.248511 ... NaN NaN -0.042548 NaN NaN 0.925011 -0.173468 -0.180521 0.139707 -0.882283
01CO006 0.188931 0.544620 NaN -0.571640 -0.209734 0.799090 NaN -0.338493 -0.042567 NaN ... -0.411664 -0.454109 -0.725892 NaN NaN -0.707588 -0.846624 0.329813 -0.311147 -0.446358
01CO008 0.404810 -0.246523 -0.053940 0.252995 0.190861 0.101419 -0.502876 0.627060 0.089815 -0.106411 ... 0.192279 -0.558236 -0.093708 -1.874293 -0.248307 -0.899186 -0.526260 0.668713 0.109366 -1.125296
01CO013 -0.276982 -0.017659 NaN -0.455055 0.500686 -0.350366 NaN 0.263168 0.683830 NaN ... 0.220231 NaN 0.241860 -3.939263 NaN 0.514931 -0.078267 0.122032 0.130764 -1.146911
01CO014 -0.160155 0.100022 0.259696 0.341345 -0.310265 0.095461 -0.745855 1.006614 NaN NaN ... -0.198671 0.226146 0.036229 NaN NaN 1.189468 0.117736 0.586529 -0.006767 -1.106068

5 rows × 9457 columns