根据数据框中的条件消除重复

时间:2019-09-20 23:44:15

标签: python pandas iteration

这是我的数据框:

Fruits         Person        Eat

Banana         Peter         Yes 
Banana         Ashley        Yes
Strawberry     Peter         No
Strawberry     Ashley        Yes 
Cherry         Peter         Yes
Orange         Peter         No
Orange         Ashley        No
Grape          Ashley        Yes
Pear           Ashley        Yes
Pear           Peter         Yes

我的数据框中有重复的水果。我需要根据以下逻辑删除重复项。如果有重复的水果并且Peter和Ashley都吃了它,则保留Peter的行,并删除Ashley的行。如果有重复的水果而Peter不吃而Ashley吃了,则删除Peter的行,并保留Ashley的行。如果有重复的水果而Peter不吃而Ashley不吃,则两行都将被删除。

采用这种逻辑,数据帧应输出为:

Fruits         Person        Eat

Banana         Peter         Yes 
Strawberry     Ashley        Yes 
Cherry         Peter         Yes
Grape          Ashley        Yes
Pear           Peter         Yes

我不确定如何在这些条件下遍历熊猫数据框以删除重复项。通常,对于第一个条件,我会执行以下操作:

data = [
    {
        "fruit": "Apple",
        "person": "Ashley",
        "eats": True
    },
    {
        "fruit": "Apple",
        "person": "Peter",
        "eats": True
    }
]
eats = dict()

for i, row in enumerate(data):
    fruit = row["fruit"]
person = row["person"]
does_eat = row["eats"]
# mark whether person eats fruit
if not eats.get(person):
    eats[person] = dict()

# if person does eat, record row number for later deletion if needed if does_eat:
eats[person][fruit] = i

# dedup
if person == "Peter" and eats.get("Peter") and eats["Peter"].get(fruit):
    data.pop(eats["Ashley"][fruit])
elif person == "Ashley" and eats.get("Peter") and eats["Peter"].get(fruit):
    data.pop(i)

任何有关如何使用数据框执行此操作的帮助/提示都将非常感谢。

1 个答案:

答案 0 :(得分:0)

尝试一下:

df1 = (df[df.Eat.eq('Yes')].sort_values('Person')
                           .drop_duplicates(subset='Fruits', keep='last'))

Out[14]:
       Fruits  Person  Eat
3  Strawberry  Ashley  Yes
7       Grape  Ashley  Yes
0      Banana   Peter  Yes
4      Cherry   Peter  Yes
9        Pear   Peter  Yes