我有一个pandas df,其中每一行都是单词列表。列表中有重复的单词。我要删除重复的单词。
我尝试在for循环中使用dict.fromkeys(listname)遍历df中的每一行。但这会将单词分成字母
filepath = "C:/abc5/Python/Clustering/output2.csv"
df = pd.read_csv(filepath,encoding='windows-1252')
df["newlist"] = df["text_lemmatized"]
for i in range(0,len(df)):
l = df["text_lemmatized"][i]
df["newlist"][i] = list(dict.fromkeys(l))
print(df)
预期结果是==>
['clear', 'pending', 'order', 'pending', 'order'] ['clear', 'pending', 'order']
['pending', 'activation', 'clear', 'pending'] ['pending', 'activation', 'clear']
实际结果是
['clear', 'pending', 'order', 'pending', 'order'] ... [[, ', c, l, e, a, r, ,, , p, n, d, i, g, o, ]]
['pending', 'activation', 'clear', 'pending', ... ... [[, ', p, e, n, d, i, g, ,, , a, c, t, v, o, ...
答案 0 :(得分:2)
使用set
删除重复项。
您也不需要for循环
df["newlist"] = list(set( df["text_lemmatized"] ))
答案 1 :(得分:1)
只需使用findNavController().navigate(MenuFragmentDirections.Action_menuFragment_to_servisFragment(param))
和series.map
您的样本数据:
np.unique
如果您不希望对它进行排序,请使用Out[43]:
text_lemmatized
0 [clear, pending, order, pending, order]
1 [pending, activation, clear, pending]
df.text_lemmatized.map(np.unique)
Out[44]:
0 [clear, order, pending]
1 [activation, clear, pending]
Name: val, dtype: object
pd.unique
答案 2 :(得分:0)
df.drop_duplicates(subset ="text_lemmatized",
keep = First, inplace = True)
keep =首先,意味着保持第一次出现
答案 3 :(得分:0)
问题不是列表,而是字符串,因此有必要通过ast.literal_eval
将每个值转换为列表,然后可以将值转换为set
来删除重复项:
import ast
df['newlist'] = df['text_lemmatized'].map(lambda x: list(set(ast.literal_eval(x))))
print(df)
text_lemmatized newlist
0 [clear, pending, order, pending, order] [clear, pending, order]
1 [pending, activation, clear, pending] [clear, activation, pending]
或使用dict.fromkeys
:
f = lambda x: list(dict.fromkeys(ast.literal_eval(x)))
df['newlist'] = df['text_lemmatized'].map(f)
另一种方法是一步将text_lemmatized
列转换为列表,然后在另一步中删除重复项,优点是text_lemmatized
列中的列表用于下一步处理:
df['text_lemmatized'] = df['text_lemmatized'].map(ast.literal_eval)
df['newlist'] = df['text_lemmatized'].map(lambda x: list(set(x)))
编辑:
经过讨论后,解决方案是:
df['newlist'] = df['text_lemmatized'].map(lambda x: list(set(x)))
答案 4 :(得分:0)
您用于删除重复项的代码似乎正常。 我尝试了以下操作,效果很好。 猜猜问题出在将列表添加到dataframe列中的方式。
`list_from_df = [['clear', 'pending', 'order', 'pending', 'order'],
['pending', 'activation', 'clear', 'pending']]
list_with_unique_words = []
for x in list_from_df:
unique_words = list(dict.fromkeys(x))
list_with_unique_words.append(unique_words)
print(list_with_unique_words)
输出[['clear','pending','order'],['pending','activation','clear']]
df["newlist"] = list_with_unique_words
df
`
答案 5 :(得分:0)
解决方案==>
import pandas as pd
filepath = "C:/abc5/Python/Clustering/output2.csv"
df = pd.read_csv(filepath,encoding='windows-1252')
df["newlist"] = df["text_lemmatized"]
df['newlist'] = df['text_lemmatized'].map(lambda x: list(set(x)))
print(df)
感谢jezrael和其他所有帮助缩小此解决方案的人