我有一个熊猫数据框:
import pandas as pd
df = pd.DataFrame({'category':[0,1,2],
'text': ['this is some text for the first row',
'second row has this text',
'third row this is the text']})
df.head()
我想得到以下结果(每行中没有重复的单词):
预期结果(对于上面的示例):
category text
0 is some for the first
1 second has
2 third is the
使用以下代码,我尝试将行中的所有数据获取到一个字符串:
final_list =[]
for index, rows in df.iterrows():
# Create list for the current row
my_list =rows.text
# append the list to the final list
final_list.append(my_list)
# Print the list
print(final_list)
text=''
for i in range(len(final_list)):
text+=final_list[i]+', '
print(text)
此问题(pandas dataframe- how to find words that repeat in each row)中的想法并不能帮助我获得预期的结果。
arr = [set(x.split()) for x in text.split(',')]
mutual_words = set.intersection(*arr)
result = [list(x.difference(mutual_words)) for x in arr]
result = sum(result, [])
final_text = (", ").join(result)
print(final_text)
有人知道如何获得吗?
答案 0 :(得分:3)
您可以使用Series.str.split
在定界符空间周围拆分列text
,然后使用reduce
获取所有行中找到的单词的交集,最后使用{{1} }删除常用词:
str.replace
from functools import reduce
w = reduce(lambda x, y: set(x) & set(y), df['text'].str.split())
df['text'] = df['text'].str.replace(rf"(\s*)(?:{'|'.join(w)})\s*", r'\1').str.strip()