熊猫数据框-如何消除列中的重复字词

时间:2020-09-27 19:14:16

标签: python pandas dataframe

我有一个熊猫数据框:

import pandas as pd

df = pd.DataFrame({'category':[0,1,2],
                   'text': ['this is some text for the first row',
                            'second row has this text',
                            'third row this is the text']})
df.head()

我想得到以下结果(每行中没有重复的单词):

预期结果(对于上面的示例):

category     text
0            is some for the first
1            second has
2            third is the

使用以下代码,我尝试将行中的所有数据获取到一个字符串:

final_list =[]
for index, rows in df.iterrows():
    # Create list for the current row
    my_list =rows.text
    # append the list to the final list
    final_list.append(my_list)
# Print the list
print(final_list)
text=''

for i in range(len(final_list)):
    text+=final_list[i]+', '

print(text)

此问题(pandas dataframe- how to find words that repeat in each row)中的想法并不能帮助我获得预期的结果。

arr = [set(x.split()) for x in text.split(',')]
mutual_words = set.intersection(*arr)
result = [list(x.difference(mutual_words)) for x in arr]
result = sum(result, [])
final_text = (", ").join(result)
print(final_text)

有人知道如何获得吗?

1 个答案:

答案 0 :(得分:3)

您可以使用Series.str.split在定界符空间周围拆分列text,然后使用reduce获取所有行中找到的单词的交集,最后使用{{1} }删除常用词:

str.replace

from functools import reduce

w = reduce(lambda x, y: set(x) & set(y), df['text'].str.split())
df['text'] = df['text'].str.replace(rf"(\s*)(?:{'|'.join(w)})\s*", r'\1').str.strip()