Pandas如何在Multiindex DataFrame中获取具有多个索引级别值的行列表

时间:2016-04-11 02:09:14

标签: python pandas

我从一个数据透视操作得到了这个Dataframe,我不知道如何在pandas中处理“嵌套”或多索引的数据帧。

Dataframe看起来像下面这个例子,只有比这里显示的行多得多。 [编辑:添加了一个额外的“chr18”行,以提供更具说明性的示例。这也需要过滤掉]

mmc chrom   start     stop      experiment  isdone  strand  countL countR
3   chr18   2044696   2044716   hj-10_b_10  FALSE   -        12     12
            2060000   2061000   hj-10_b_10  FALSE   -        162    162
    chr3    95359191  95359212  hj-10_b_10  FALSE   -        2497   2497
                                hj-9_b_9    TRUE    -        3476   3477
                                hj1_100_3   TRUE    -        2351   2351
4   chr19   598940    598961    hj-10_b_10  FALSE   -        494    494
                                hj1_100_3*1 TRUE    -        211    211

我想从这个DataFrame中筛选出在实验级别中有多个条目的所有“chrom”条目,即选择所有的chrom和start,stop列,它们在实验索引级别中有多个条目。

结果我想要的数据帧(注意它没有mmc:3 chrom:18个条目,因为这两个条目只有一个实验“hj-10_b_10”,因此不会多次复制)。

mmc chrom   start     stop      experiment  isdone  strand  countL countR
 3  chr3    95359191  95359212  hj-10_b_10  FALSE   -        2497   2497
                                hj-9_b_9    TRUE    -        3476   3477
                                hj1_100_3   TRUE    -        2351   2351
 4  chr19   598940    598961    hj-10_b_10  FALSE   -        494    494
                                hj1_100_3*1 TRUE    -        211    211

我可以在熊猫之外做这件事,但因为我想学习熊猫的方式。

如何从海量数据框中选择超过特定指数级别的特定计数的所有条目。

更新

您可以使用此代码创建MultiIndex DataFrame

import pandas
from pandas import DataFrame
index_tuples_mmc= [3,3,3,3,3,4,4]
index_tuples_chrom = ["chr18","chr18","chr3","chr3","chr3","chr19","chr19"]
index_tuples_start = ["2044696","2060000","95359191","95359191","95359191","598940","598940"]
index_tuples_stop = ["2044716" ,"2061000","95359212", "95359212" , "95359212" ,"598961" , "598961"]
index_tuples_experiment = ["hj-10_b_10","hj-10_b_10","hj-10_b_10","hj-9_b_9","hj1_100_3","hj-10_b_10","hj1_100_3*1"]
index_tuples_idone = ["FALSE","FALSE","FALSE","TRUE","TRUE","FALSE","TRUE"]
index_tuples_strand = ["-","-","-","-","-","-","-"]
arrays = [index_tuples_mmc,index_tuples_chrom,index_tuples_start,\
         index_tuples_stop,index_tuples_experiment,index_tuples_idone,\
         index_tuples_strand]
tuples = list(zip(*arrays))
index = pandas.MultiIndex.from_tuples(tuples,names=["mmc","chrom",\
                                                "start","stop","experiment","isdone",\
                                                   "strand"])
df2 = DataFrame([12,162,2497,3476,2351,494,211],index=index,columns=["countL"])
df2["countR"]=df2["countL"]

1 个答案:

答案 0 :(得分:0)

你可以试试这个:

idx=pd.IndexSlice
df.loc[idx[:,['chr3','chr19'],:,:,:,:,:,],:]

想要了解有关MultiIndex / Advanced Indexing的更多信息,请查看此处 http://pandas.pydata.org/pandas-docs/stable/advanced.html