为非常相似的“重复”

时间:2018-03-02 23:39:18

标签: python pandas duplicates

目标:为非常相似的“重复”保留一行

背景: 下面的df表明ID 0ID 1彼此“重复”("hey there https://abc""hey there https://efg") 。除了每个文件末尾的abcefg外,文字几乎相同。

import pandas as pd    
df = pd.DataFrame(dict(ID=[1,2,3,4], Text=["hey there https://abc", "hey there https://efg", "hello", "hi"]))

输出:

    ID  Text
0   1   hey there https://abc
1   2   hey there https://efg
2   3   hello
3   4   hi

我可以使用以下代码删除重复项:

df.drop_duplicates(subset=['Text'], keep="first")

但由于ID 1ID 2并不完全相同,因此上述代码无效。

问题:如何获得以下输出?

    ID Text
0   1   hey there https://abc
1   3   hello
2   4   hi

1 个答案:

答案 0 :(得分:0)

#convert column into list of strings
dfList = df['Text'].tolist()

output:
['hey there https://abc', 'hey there https://efg', 'hello', 'hi']


#split strings on https
split = [i.split('https:', 1)[0] for i in dfList]

output:
['hey there ', 'hey there ', 'hello', 'hi']



#put list back in df
df['Split_Text'] = split

output:

    ID  Text                    Split_Text
0   1   hey there https://abc   hey there
1   2   hey there https://efg   hey there
2   3   hello                   hello
3   4   hi                      hi


#keep one duplicate row
Nodup_df = df.drop_duplicates(subset=['Split_Text'], keep="first")


output:
    ID  Text                    Split_Text
0   1   hey there https://abc   hey there
2   3   hello                   hello
3   4   hi                      hi


#eliminate Split_Text column
Nodup_df.iloc[:, 0:2]

output:

    ID  Text
0   1   hey there https://abc
2   3   hello
3   4   hi