如何在pandas多索引数据框中仅选择索引列?

时间:2017-12-14 06:48:10

标签: python-3.x pandas dataframe multi-index

好的,我有一个带有2列索引的DataFrame,我正在尝试从该DataFrame中过滤行,并将原始数据帧的INDEX COLUMNS保留在新的过滤后的DataFrame中。

我通过以下方式从CSV文件创建了数据框:查找CSV文件here

census_df = pd.read_csv("census.csv", index_col = ["STNAME", "CTYNAME"])
census_df.sort_index(ascending = True)

然后,我对DataFrame应用了一些过滤,它完全正常,我得到了所需的行。我使用的代码如下所示:

def my_answer():

    mask1 = census_df["REGION"].between(1, 2)
    mask2 = census_df.index.get_level_values("CTYNAME").str.startswith("Washington")
    mask3 = (census_df["POPESTIMATE2015"] > census_df["POPESTIMATE2014"])
    new_df = census_df[mask1 & mask2 & mask3]
    return pd.DataFrame(new_df.iloc[:, -1])

my_answer()

问题在于:

上面的代码返回一个数据框,其索引和第一列IN ADDITION到2个索引列。我想要的只是两个索引列。 因此,最终答案应返回DATAFRAME,其中包含" STNAME"和" CTYNAME",其中有5行。

2 个答案:

答案 0 :(得分:0)

您可以将index转换为DataFrame

def my_answer():

    mask1 = census_df["REGION"].between(1, 2)
    mask2 = census_df.index.get_level_values("CTYNAME").str.startswith("Washington")
    mask3 = (census_df["POPESTIMATE2015"] > census_df["POPESTIMATE2014"])
    new_df = census_df[mask1 & mask2 & mask3]
    return pd.DataFrame(new_df.index.tolist(), columns=['STNAME','CTYNAME'])

print (my_answer())

         STNAME            CTYNAME
0          Iowa  Washington County
1     Minnesota  Washington County
2  Pennsylvania  Washington County
3  Rhode Island  Washington County
4     Wisconsin  Washington County

如果想要MultiIndex的输出需要MultiIndex.remove_unused_levels,但它在pandas 0.20.0+中工作:

def my_answer():

    mask1 = census_df["REGION"].between(1, 2)
    mask2 = census_df.index.get_level_values("CTYNAME").str.startswith("Washington")
    mask3 = (census_df["POPESTIMATE2015"] > census_df["POPESTIMATE2014"])
    new_df = census_df[mask1 & mask2 & mask3]
    return new_df.index.remove_unused_levels()

print (my_answer())

MultiIndex(levels=[['Iowa', 'Minnesota', 'Pennsylvania', 'Rhode Island', 'Wisconsin'], 
                   ['Washington County']],
           labels=[[0, 1, 2, 3, 4], [0, 0, 0, 0, 0]],
           names=['STNAME', 'CTYNAME'])

答案 1 :(得分:0)

使用列表理解:

def my_answer():
     mask1 = census_df["REGION"].between(1, 2)
     mask2 = census_df.index.get_level_values("CTYNAME").str.startswith("Washington")
     mask3 = (census_df["POPESTIMATE2015"] > census_df["POPESTIMATE2014"])
     new_df = census_df[mask1 & mask2 & mask3]

     return pd.DataFrame([new_df.index[x] for x in range(len(new_df))])    

my_answer()

输出:

    0              1
 0  Iowa         Washington County
 1  Minnesota    Washington County
 2  Pennsylvania Washington County
 3  Rhode Island Washington County
 4  Wisconsin    Washington County``