可靠的解决方案：`DataFrame.iloc`与`Series.idxmax`

Question

我们如何将特定的过滤行作为系列？

示例数据框：

>>> df = pd.DataFrame({'date': [20130101, 20130101, 20130102], 'location': ['a', 'a', 'c']})
>>> df
       date location
0  20130101        a
1  20130101        a
2  20130102        c

我需要选择location为c 的行作为系列。

我试过了：

row = df[df["location"] == "c"].head(1)  # gives a dataframe
row = df.ix[df["location"] == "c"]       # also gives a dataframe with single row

在任何一种情况下，我都不能将该行作为系列。

Answer 1

使用将从数据框中删除一个维度的squeeze函数：

df[df["location"] == "c"].squeeze()
Out[5]: 
date        20130102
location           c
Name: 2, dtype: object

当设置为DataFrame.squeeze时，

squeeze方法与read_csv函数的True参数的行为方式相同：如果结果数据帧是1-len数据帧，即它只有一个维度（一列或一行），然后将对象压缩到较小的维度对象。

在您的情况下，您从DataFrame获得一个Series对象。如果将Panel压缩到DataFrame，则适用相同的逻辑。

压缩在你的代码中是明确的，并且清楚地显示了你手中“抛弃”对象的意图，因为它的尺寸可以投射到较小的尺寸。

如果数据框有多个列或行，则squeeze无效。

Answer 2

您可以使用整数索引（iloc()函数）获取第一行：

>>> df[df["location"] == "c"].iloc[0]
date        20130102
location           c
Name: 2, dtype: object

Answer 3

如何从pandas DataFrame中获得特定系列作为行？

可靠的解决方案：`DataFrame.iloc`与`Series.idxmax`

作为更好的选择，如果您可以保证至少有一行符合条件，请在遮罩上使用Series.idxmax()，并使用单 DataFrame.iloc通话。

df.iloc[(df['location'] == 'c').idxmax()]

date        20130102
location           c
Name: 2, dtype: object

可以说，这可以替代当前发布的答案，因为它保证返回一行（并且只有一行），并且从不复制副本。

其他答案的批判

在接受的答案中，只是顺便提到了squeeze如果返回多于一行没有任何作用，但这就是问题所在

df

       date location
0  20130101        a
1  20130101        a
2  20130102        c

df[df["location"] == "c"].squeeze()   # Works as expected.

date        20130102
location           c
Name: 2, dtype: object

现在考虑，当多于一行满足此条件时。

df2 = pd.concat([df] * 2, ignore_index=True)
df2

       date location
0  20130101        a
1  20130101        a
2  20130102        c
3  20130101        a
4  20130101        a
5  20130102        c

df2[df2["location"] == "c"].squeeze() # No effect.

       date location
2  20130102        c
5  20130102        c

对于idxmax，总是返回"location"] == "c"结果中具有最高值的第一行的索引（如果至少一行满足条件，则为True）。这样您每次都会获得一个系列。

接下来，@ RomanPekar的答案在布尔索引调用的结果上使用iloc，布尔调用可能会也可能不会返回副本。更不用说，如果您尝试重新分配新行，这将成为一个问题：

df[df["location"] == "c"].iloc[0] = pd.Series({'location': 'd', 'date': np.nan})
# SettingWithCopyWarning: 
# A value is trying to be set on a copy of a slice from a DataFrame.
# Try using .loc[row_indexer,col_indexer] = value instead

您会得到一个SettingWithCopyWarning（您可以了解有关here的更多信息）。

如果您使用单个iloc呼叫，这不是问题：

df.iloc[(df['location'] == 'c').idxmax()] = (
    pd.Series({'location': 'd', 'date': np.nan}))
df

         date location
0  20130101.0        a
1  20130101.0        a
2         NaN        d

Caveat Emptor

idxmax将返回True结果中的第一行的索引df['location'] == 'c'：

df2.iloc[(df2['location'] == 'c').idxmax()]

date        20130102
location           c
Name: 2, dtype: object

但是，当根本没有行时，可以看到这里的警告。 idxmax将始终只返回第一行的索引（因为第一行的值为False，这是掩码中的最大值）。

df3 = df.query('location == "a"')
df3

       date location
0  20130101        a
1  20130101        a

# This will produce an incorrect result.
df3.iloc[(df3['location'] == 'c').idxmax()]  

date        20130101
location           a
Name: 0, dtype: object

因此，您可以添加一些错误处理代码来处理这些极端情况。我的建议是简洁的内联if-else语句：

df3.iloc[mask.idxmax()] if mask.any() else None

一些例子

# Correct handling of corner case.
m = df3['location'] == 'c'
ser = df3.iloc[m.idxmax()] if m.any() else None
print(ser)
# None

# Correct handling of the standard case.
m = df3['location'] == 'a'
df3.iloc[m.idxmax()] if m.any() else None

date        20130101
location           a
Name: 0, dtype: object

从pandas dataframe获取特定行作为系列

3 个答案:

如何从pandas DataFrame中获得特定系列作为行？

可靠的解决方案：`DataFrame.iloc`与`Series.idxmax`

其他答案的批判

Caveat Emptor

从pandas dataframe获取特定行作为系列

3 个答案:

如何从pandas DataFrame中获得特定系列作为行？

可靠的解决方案：DataFrame.iloc与Series.idxmax

其他答案的批判

Caveat Emptor

可靠的解决方案：`DataFrame.iloc`与`Series.idxmax`