根据多列条件选择熊猫数据框中的行

时间:2018-11-09 12:11:33

标签: python pandas

我的数据集如下

id     date  time  domain       activity
1  20thdec     2  amazon  add to basket
1  21stdec     2  amazon   product view
1  21stdec     3  amazon  add to basket
1  21stdec     4  amazon  add to basket
2  21stdec     4  amazon  add to basket
2  21stdec     6  amazon  add to basket 

如何清除活动列中包含相同值的行(即对于device_id = 2,唯一的活动是add to basket。我仍然想保留id 1的格式具有多个add to basket的值(重复),但它也具有其他活动

我尝试过pd.drop_duplicates,但是并不能解决问题。

编辑:以下解决方案均无效,我需要如下输出

id     date  time  domain       activity
1  20thdec     2  amazon  add to basket
1  21stdec     2  amazon   product view
1  21stdec     3  amazon  add to basket
1  21stdec     4  amazon  add to basket

应该删除id = 2的数据,就像所有活动一样,而与日期/时间无关,除了添加到购物篮外,什么都没有,因此应删除所有单个活动行,并且仅保留列出具有多个活动的数据行,例如id = 1中的用户有2个活动级别(“产品视图”和“添加到购物篮”)

道歉,如果造成误解

谢谢

4 个答案:

答案 0 :(得分:2)

IIUC,将groupbytransformnunique一起使用,并为不等于(cumsum1的值计算ne,然后使用drop_duplicatessubset参数:

df.dropna(how='all',inplace=True)
cols = df.columns
df['Unique'] = df.groupby('id')['activity'].transform('nunique')
mask = df['Unique'].ne(1)
df.loc[mask,'Unique'] = df.loc[mask,'Unique'].cumsum()

df1 = df.drop_duplicates(subset = ['activity','Unique'])[cols]

print(df1)

   id     date  time  domain       activity
0   1  20thdec     2  amazon  add to basket
1   1  21stdec     2  amazon   product view
2   1  21stdec     3  amazon  add to basket
3   1  21stdec     4  amazon  add to basket
5   2  21stdec     4  amazon  add to basket

说明:

print(df.groupby('id')['activity'].transform('nunique'))
0    2
1    2
2    2
3    2
5    1
6    1
Name: activity, dtype: int64

print(df['Unique'].ne(1))
0     True
1     True
2     True
3     True
5    False
6    False
Name: Unique, dtype: bool

# After the line df.loc[mask,'Unique'] = df.loc[mask,'Unique'].cumsum()
print(df['Unique'])
0    2
1    4
2    6
3    8
5    1
6    1
Name: Unique, dtype: int64

答案 1 :(得分:1)

我认为您需要transformnunique并通过ne ne(1)进行过滤,以返回不唯一的组:

print (df)
   id     date  time  domain       activity
0   1  20thdec     2  amazon  add to basket
1   1  21stdec     2  amazon   product view
2   1  21stdec     3  amazon  add to basket
3   1  21stdec     4  amazon  add to basket
4   2  21stdec     4  amazon  add to basket
5   2  21stdec     6  amazon  add to basket
6   3  21stdec     6  amazon  add to basket

df = df[df.groupby('id')['activity'].transform('nunique').ne(1)]
print (df)

   id     date  time  domain       activity
0   1  20thdec     2  amazon  add to basket
1   1  21stdec     2  amazon   product view
2   1  21stdec     3  amazon  add to basket
3   1  21stdec     4  amazon  add to basket

另一种解决方案是每列idactivity仅删除重复的组,因此不会删除唯一的行:

idx = df.loc[~df.duplicated(['id','activity'], keep=False), 'id'].unique()
df = df[df['id'].isin(idx)]

或者:

df = df[~df.duplicated(['id','activity'], keep=False).groupby(df['id']).transform('all')]

print (df)
   id     date  time  domain       activity
0   1  20thdec     2  amazon  add to basket
1   1  21stdec     2  amazon   product view
2   1  21stdec     3  amazon  add to basket
3   1  21stdec     4  amazon  add to basket
6   3  21stdec     6  amazon  add to basket

答案 2 :(得分:0)

您可以在subset中指定一个drop_duplicates参数:

dataset.drop_duplicates(subset=['id', 'activity'])

答案 3 :(得分:0)

据我了解,您只想删除重复项,其中eslint .。您仍然可以使用id == 2,但是必须仅在数据帧中具有drop_duplicates的行中指定subset='activity'。然后,您将id==2与具有concat

的行一起使用
id==1

给予

df = pd.concat([df[df['id'] == 1], df[df['id'] == 2].drop_duplicates(subset='activity')])