Question

目标：为非常相似的“重复”保留一行

背景：下面的df表明ID 0和ID 1彼此“重复”（"hey there https://abc"和"hey there https://efg"）。除了每个文件末尾的abc和efg外，文字几乎相同。

import pandas as pd    
df = pd.DataFrame(dict(ID=[1,2,3,4], Text=["hey there https://abc", "hey there https://efg", "hello", "hi"]))

输出：

    ID  Text
0   1   hey there https://abc
1   2   hey there https://efg
2   3   hello
3   4   hi

我可以使用以下代码删除重复项：

df.drop_duplicates(subset=['Text'], keep="first")

但由于ID 1和ID 2并不完全相同，因此上述代码无效。

问题：如何获得以下输出？

    ID Text
0   1   hey there https://abc
1   3   hello
2   4   hi

Answer 1

#convert column into list of strings
dfList = df['Text'].tolist()

output:
['hey there https://abc', 'hey there https://efg', 'hello', 'hi']


#split strings on https
split = [i.split('https:', 1)[0] for i in dfList]

output:
['hey there ', 'hey there ', 'hello', 'hi']



#put list back in df
df['Split_Text'] = split

output:

    ID  Text                    Split_Text
0   1   hey there https://abc   hey there
1   2   hey there https://efg   hey there
2   3   hello                   hello
3   4   hi                      hi


#keep one duplicate row
Nodup_df = df.drop_duplicates(subset=['Split_Text'], keep="first")


output:
    ID  Text                    Split_Text
0   1   hey there https://abc   hey there
2   3   hello                   hello
3   4   hi                      hi


#eliminate Split_Text column
Nodup_df.iloc[:, 0:2]

output:

    ID  Text
0   1   hey there https://abc
2   3   hello
3   4   hi

为非常相似的“重复”

1 个答案: