Question

我的数据框看起来像这样：

NUM   A      B        C      D        E        F
p1    NaN    -1.183   NaN    NaN      NaN      1.829711
p5    NaN    NaN      NaN    NaN      1.267   -1.552721
p9    1.138  NaN      NaN    -1.179   NaN      1.227306

在F列中始终存在非NaN值，并且至少有一个其他列A-E。

我想创建一个子表，其中只包含列中包含某些非NaN值组合的行。存在许多这些期望的组合，包括双峰和三重峰。以下是我想提取的三种组合的例子：

在A列和A列中包含非NaN值的行。乙
在C＆amp; C中包含非NaN值的行d
在A＆amp; A中包含非NaN值的行。 B＆amp; ç

我已经知道来自此question的np.isfinite和pd.notnull命令，但我不知道如何将它们应用于列的组合。

此外，一旦我有一个用于删除与我所需组合之一不匹配的行的命令列表，我不知道如果它们与任何所需组合不匹配，我不知道如何告诉Pandas仅删除行。

Answer 1

很多时候，我们需要对布尔数组（numpy数组或pandas系列）进行逻辑运算，作为选择数据帧子集的一部分。使用'和'，'或'，'not'运算符不起作用。

In [79]: df[pd.notnull(df['A']) and pd.notnull(df['F'])]

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

在Python中，当使用'and'，'或'和'not'运算符时，非布尔变量通常被认为是True，除非它们代表[]，int(0)等“空”对象，float(0)，None等等。因此，使用这些相同的运算符在Pandas中进行数组式布尔运算会很困惑。有些人会希望他们只是评估为True

相反，我们应该使用&，|和~。

In [69]: df[pd.notnull(df['A']) & pd.notnull(df['F'])]
Out[69]:
  NUM      A   B   C      D   E         F
2  p9  1.138 NaN NaN -1.179 NaN  1.227306

另一种较短但不太灵活的方法是使用any()，all()或empty。

In [78]: df[pd.notnull(df[['A', 'F']]).all(axis=1)]
Out[78]:
  NUM      A   B   C      D   E         F
2  p9  1.138 NaN NaN -1.179 NaN  1.227306

您可以阅读有关此here

的更多信息

Answer 2

您可以在选择非Nan值的地方使用apply和lambda函数。您可以使用Numpy.isNan(..)验证其是否为Nan值。

data="""NUM   A      B        C      D        E        F
p1    NaN    -1.183   NaN    NaN      NaN      1.829711
p5    NaN    NaN      NaN    NaN      1.267   -1.552721
p9    1.138  NaN      NaN    -1.179   NaN      1.227306"""

import pandas as pd
from io import StringIO

df= pd.read_csv(StringIO(data.decode('UTF-8')),delim_whitespace=True )
print df



# Rows which contain non-NaN values in columns A & B
df["A_B"]= df.apply(lambda x: x['A'] if np.isnan(x['B']) else x['B'] if np.isnan(x['A']) else 0, axis=1)

# Rows which contain non-NaN values in C & D
df["C_D"]= df.apply(lambda x: x['C'] if np.isnan(x['D']) else x['D'] if np.isnan(x['C']) else 0, axis=1)

# Rows which contain non-NaN values in A & B & C
df["A_B_C"]= df.apply(lambda x: x['C'] if np.isnan(x['A_B']) else x['A_B'] if np.isnan(x['C']) else 0, axis=1)
print df

# Rows which contain non-NaN values in A & B & C
df["A_B_C_D"]= df.apply(lambda x: x['A_B'] if np.isnan(x['C_D']) else x['C_D'] if np.isnan(x['A_B']) else 0, axis=1)
print df

输出：

  NUM      A      B   C      D      E         F    A_B    C_D  A_B_C
0  p1    NaN -1.183 NaN    NaN    NaN  1.829711 -1.183    NaN -1.183
1  p5    NaN    NaN NaN    NaN  1.267 -1.552721    NaN    NaN    NaN
2  p9  1.138    NaN NaN -1.179    NaN  1.227306  1.138 -1.179  1.138

如果您不需要通过有条件的案件，您可以查看另一篇文章中解释的其他方式。

Answer 3

我们假设您的数据框名为df。你可以像这样使用布尔掩码。

# Specify column combinations that you want to pull 
combo1 = ['A', 'B'] 

# Select rows in the data frame that have non-NaN values in the combination
# of columns specified above

notmissing = ((df.loc[:, combo1].notnull()))
df = df.loc[notmissing, :]

Pandas - 根据NaN值的组合删除行

3 个答案: