目标:为非常相似的“重复”保留一行
背景:
下面的df
表明ID
0
和ID
1
彼此“重复”("hey there https://abc"
和"hey there https://efg"
) 。除了每个文件末尾的abc
和efg
外,文字几乎相同。
import pandas as pd
df = pd.DataFrame(dict(ID=[1,2,3,4], Text=["hey there https://abc", "hey there https://efg", "hello", "hi"]))
输出:
ID Text
0 1 hey there https://abc
1 2 hey there https://efg
2 3 hello
3 4 hi
我可以使用以下代码删除重复项:
df.drop_duplicates(subset=['Text'], keep="first")
但由于ID
1
和ID
2
并不完全相同,因此上述代码无效。
问题:如何获得以下输出?
ID Text
0 1 hey there https://abc
1 3 hello
2 4 hi
答案 0 :(得分:0)
#convert column into list of strings
dfList = df['Text'].tolist()
output:
['hey there https://abc', 'hey there https://efg', 'hello', 'hi']
#split strings on https
split = [i.split('https:', 1)[0] for i in dfList]
output:
['hey there ', 'hey there ', 'hello', 'hi']
#put list back in df
df['Split_Text'] = split
output:
ID Text Split_Text
0 1 hey there https://abc hey there
1 2 hey there https://efg hey there
2 3 hello hello
3 4 hi hi
#keep one duplicate row
Nodup_df = df.drop_duplicates(subset=['Split_Text'], keep="first")
output:
ID Text Split_Text
0 1 hey there https://abc hey there
2 3 hello hello
3 4 hi hi
#eliminate Split_Text column
Nodup_df.iloc[:, 0:2]
output:
ID Text
0 1 hey there https://abc
2 3 hello
3 4 hi