Question

所以情况是我得到一个数据框[A，B，C，D]，当AB的两列都不为nan（A不能为nan）时，不要删除任何内容，但是当我们有[A， B]，仍然得到另一行，即A不为空，B为空。然后需要删除此类行。另一种情况是，当我们没有任何AB组合时，只得到一行A不为空，而B为空的行，此时，不能删除重复项。

例如

  A    B     C   D
[Tom, Jane, cat, dog],
[Tom, Zack, monkey, sheep],
[Tom, Nan, fish, dolphine]

因此，在这种情况下，不应删除第一列和第二列，而应删除第三列，因为Tom（在A列中已经存在）并且在B列中的值为Nan，因此应删除第三列。

另一种情况是

 A    B     C   D
[Jack, Nan, fish, dolphine]

在整个数据框中，我们只有一行，其中A列中的值为Jack，所以无论B是否为nan，我们都不会删除此列。

Answer 1

您可以使用一行来实现所需的结果：

df = df[df.apply(lambda row: not((row['B'] is np.nan) & (len(df[df['A'] == row[dup_col]]) > 1)), axis=1)]

详细信息

这里的解决方案是将df.apply()与python的lambda函数结合使用。

设置

import pandas as pd
import numpy as np

data = {
    'A':['Tom', 'Tom', 'Tom','Jack'],
    'B':['Jane', 'Zack', np.nan,np.nan],
    'C':['Jane', 'Bear' , 'Cat','Bear'],
    'D':['Jane', 'Bear' , 'Cat','Bear'],
    }

# Create the data frame
df = pd.DataFrame(data)

# Set columns to check for duplicate and np.nan
dup_col = 'A'
nan_col = 'B'

# Print df before filter
print(df.head())

      A     B     C     D
0   Tom  Jane  Jane  Jane
1   Tom  Zack  Bear  Bear
2   Tom   NaN   Cat   Cat
3  Jack   NaN  Bear  Bear

使用df.apply将函数应用于沿轴，并指定axis=1将函数应用于每行。

lambda函数使我们可以利用行变量
内部条件是您定义为重复的条件
即'B'col是Nan，'A'是重复

我将其分成多行以使其易于理解，但实际上可以一行完成。


df = df[
    df.apply(lambda row: 
    not(
        (row[nan_col] is np.nan) & (len(df[df[dup_col] == row[dup_col]]) > 1)
    ), axis=1)
    ]

# Print after filter
print(df.head())

      A     B     C     D
0   Tom  Jane  Jane  Jane
1   Tom  Zack  Bear  Bear
3  Jack   NaN  Bear  Bear

Answer 2

这是我找到的解决方案：

is_na = df['B'].isna() #This transformation (NaN -> True/False) is necessary to count

new_df = df[is_na].filter(['A'])
new_df['B'] = is_na #new_df has column A and column B with Trues and Falses
counting_nans = new_df.groupby('A')['B'].count()

counting_nans 的NaN数量按列 A 的值分组：

>>> df

    A   B   C   D
0   Tom     Jane    Cat     Bear
1   Tom     Jenny   Monkey  Tortue
2   Tom     NaN     Fish    Cow
3   Zac     NaN     Dog     Penguin

>>> counting_nans

A
Tom    1
Zac    1
Name: B, dtype: int64

在 uniques 中，我们将存储所有必须求值的值。

uniques = df['A'].value_counts()

>>> uniques

Tom    3
Zac    1
Name: A, dtype: int64

现在，让我们过滤掉它。如果值在“ A”列中出现的次数与“ B”列中NaN的次数等于，则不应删除行，并且在“ A”中仅出现一次，我们也可以将其删除（在特定行 df ['B'] 是否为NaN都没关系）

uniques.sort_index(inplace=True)
counting_nans.sort_index(inplace=True)

uniques = uniques[ uniques != counting_nans]
uniques = uniques[ uniques > 1 ]

condition = df['A'].isin(uniques.index) & df['B'].isna() 
#This is an array with Trues when df['A'] is in values to be evaluated and df['B'] is NaN
index_condition = condition.loc[condition == True].index #These are the indexes

df.drop(index_condition, inplace=True) #This eliminates the rows

>>> df

     A      B       C        D
0  Tom   Jane     Cat     Bear
1  Tom  Jenny  Monkey   Tortue
3  Zac    NaN     Dog  Penguin

希望有帮助！让我知道我的代码是否不清楚。另外，我敢肯定，有一种更简单的方法，我对xD编程很新

当B列为nan时，pandas数据框删除A列的重复项

2 个答案:

详细信息

设置