Question

mukey   cokey     hzdept_r  hzdepb_r
422927  11090397    0        20
422927  11090397    20       71
422927  11090397    71       152
422927  11090398    0        18
422927  11090398    18       117
422927  11090398    117      152

我想对上面的数据帧进行子集化，以便仅选择第一组cokey（在这种情况下为11090397）。当然，由于这是一个样本数据集，因此解决方案需要扩展到这种数据帧的更大版本。

在这种情况下，结果数据集应为：

mukey   cokey     hzdept_r  hzdepb_r
422927  11090397    0        20
422927  11090397    20       71
422927  11090397    71       152

我尝试过使用groupby，但不知道如何从那里只选择第一个cokey值。

Answer 1

如果您正在寻找df中第一个与df中的第一个cokey相同的所有cokey，请使用：

test[test['cokey'] == test.cokey[0]]

编辑： @dsm是正确的，上面的代码将为您提供索引为零的cokey，因此如果您的df没有从零开始的自动增量索引，您可能无法获得实际所需的结果。而是使用：

test[test['cokey'] == test.iloc[0]['cokey']]

Answer 2

如果df是样本数据帧：

cokeys = set(df.cokey) #unique keys
for k in cokeys:
    print df[df.cokey==k] #sub-dataframes

结果：

    mukey     cokey  hzdept_r  hzdepb_r
0  422927  11090397         0        20
1  422927  11090397        20        71
2  422927  11090397        71       152
    mukey     cokey  hzdept_r  hzdepb_r
3  422927  11090398         0        18
4  422927  11090398        18       117
5  422927  11090398       117       152

如果您只想要第一个数据帧，请k=df.iloc[0].cokey。

Answer 3

另一种方法是获取第一个唯一值：

In [97]:

df[df['cokey'] == df['cokey'].unique()[0]]
Out[97]:
    mukey     cokey  hzdept_r  hzdepb_r
0  422927  11090397         0        20
1  422927  11090397        20        71
2  422927  11090397        71       152

您还可以使用基于整数的索引来获取第一个过滤值：

In [99]:

df[df['cokey'] == df['cokey'].iloc[0]]
Out[99]:
    mukey     cokey  hzdept_r  hzdepb_r
0  422927  11090397         0        20
1  422927  11090397        20        71
2  422927  11090397        71       152

基于字段的子集数据帧

3 个答案: