我有一个df,
Sr.No Name Class Data
0 1 Sri 1 sri is a good player
1 '' Sri 2 sri is good in cricket
2 '' Sri 3 sri went out
3 2 Ram 1 Ram is a good player
4 '' Ram 2 sri is good in cricket
5 '' Ram 3 Ram went out
6 3 Sri 1 sri is a good player
7 '' Sri 2 sri is good in cricket
8 '' Sri 3 sri went out
9 4 Sri 1 sri is a good player
10 '' Sri 2 sri is good in cricket
11 '' Sri 3 sri went out
12 '' Sri 4 sri came back
我试图根据[“名称”,“类”,“数据”]删除重复项。目标是根据每个Sr号的所有句子删除重复项。
我的预期输出是,
out_df
Sr.No Name Class Data
0 1 Sri 1 sri is a good player
1 Sri 2 sri is good in cricket
2 Sri 3 sri went out
3 2 Ram 1 Ram is a good player
4 Ram 2 sri is good in cricket
5 Ram 3 Ram went out
9 4 Sri 1 sri is a good player
10 Sri 2 sri is good in cricket
11 Sri 3 sri went out
12 Sri 4 sri came back
答案 0 :(得分:1)
使用groupby
+ transform
操作创建一个虚拟列。
v = df.groupby(df['Class'].diff().le(0).cumsum())['Data'].transform(' '.join)
或者,
v = df['Data'].groupby(df['Class'].diff().le(0).cumsum()).transform(' '.join)
在决定要删除哪些行时,此虚拟列成为一个因素。
m = df.assign(Foo=v).duplicated(["Name", "Class", "Data", "Foo"])
df[~m]
Class Data Name Sr.No
0 1 sri is a good player Sri 1
1 2 sri is good in cricket Sri
2 3 sri went out Sri
3 1 Ram is a good player Ram 2
4 2 sri is good in cricket Ram
5 3 Ram went out Ram
9 1 sri is a good player Sri 4
10 2 sri is good in cricket Sri
11 3 sri went out Sri
12 4 sri came back Sri
<强>详情
从单调递增的Class
值中形成组 -
i = df['Class'].diff().le(0).cumsum()
i
0 0
1 0
2 0
3 1
4 1
5 1
6 2
7 2
8 2
9 3
10 3
11 3
12 3
Name: Class, dtype: int64
使用此功能进行分组,并使用Data
操作转换str.join
-
v = df.groupby(i)['Data'].transform(' '.join)
这只是一串连接的字符串。最后,分配虚拟列并调用duplicated
-
m = df.assign(Foo=v).duplicated(["Name", "Class", "Data", "Foo"])
m
0 False
1 False
2 False
3 False
4 False
5 False
6 True
7 True
8 True
9 False
10 False
11 False
12 False
dtype: bool