Question

我有一个pandas df，其中每一行都是单词列表。列表中有重复的单词。我要删除重复的单词。

我尝试在for循环中使用dict.fromkeys（listname）遍历df中的每一行。但这会将单词分成字母

filepath = "C:/abc5/Python/Clustering/output2.csv"
df = pd.read_csv(filepath,encoding='windows-1252')

df["newlist"] = df["text_lemmatized"]
for i in range(0,len(df)):
    l = df["text_lemmatized"][i]
    df["newlist"][i] = list(dict.fromkeys(l))

print(df)

预期结果是==>

['clear', 'pending', 'order', 'pending', 'order']   ['clear', 'pending', 'order']
 ['pending', 'activation', 'clear', 'pending']   ['pending', 'activation', 'clear']

实际结果是

['clear', 'pending', 'order', 'pending', 'order']  ...   [[, ', c, l, e, a, r, ,,  , p, n, d, i, g, o, ]]
['pending', 'activation', 'clear', 'pending', ...  ...  [[, ', p, e, n, d, i, g, ,,  , a, c, t, v, o, ...

Answer 1

使用set删除重复项。

您也不需要for循环

  df["newlist"] = list(set( df["text_lemmatized"] ))

Answer 2

只需使用findNavController().navigate(MenuFragmentDirections.Action_menuFragment_to_servisFragment(param))和series.map

您的样本数据：

np.unique

如果您不希望对它进行排序，请使用Out[43]: text_lemmatized 0 [clear, pending, order, pending, order] 1 [pending, activation, clear, pending] df.text_lemmatized.map(np.unique) Out[44]: 0 [clear, order, pending] 1 [activation, clear, pending] Name: val, dtype: object

pd.unique

Answer 3

df.drop_duplicates(subset ="text_lemmatized", 
                     keep = First, inplace = True)

keep =首先，意味着保持第一次出现

Answer 4

问题不是列表，而是字符串，因此有必要通过ast.literal_eval将每个值转换为列表，然后可以将值转换为set来删除重复项：

import ast

df['newlist'] = df['text_lemmatized'].map(lambda x: list(set(ast.literal_eval(x))))
print(df)
                           text_lemmatized                       newlist
0  [clear, pending, order, pending, order]       [clear, pending, order]
1    [pending, activation, clear, pending]  [clear, activation, pending]

或使用dict.fromkeys：

f = lambda x: list(dict.fromkeys(ast.literal_eval(x)))
df['newlist'] = df['text_lemmatized'].map(f)

另一种方法是一步将text_lemmatized列转换为列表，然后在另一步中删除重复项，优点是text_lemmatized列中的列表用于下一步处理：

df['text_lemmatized'] = df['text_lemmatized'].map(ast.literal_eval)
df['newlist'] = df['text_lemmatized'].map(lambda x: list(set(x)))

编辑：

经过讨论后，解决方案是：

df['newlist'] = df['text_lemmatized'].map(lambda x: list(set(x)))

Answer 5

您用于删除重复项的代码似乎正常。我尝试了以下操作，效果很好。猜猜问题出在将列表添加到dataframe列中的方式。

`list_from_df = [['clear', 'pending', 'order', 'pending', 'order'],
            ['pending', 'activation', 'clear', 'pending']] 

list_with_unique_words = []

for x in list_from_df:

    unique_words = list(dict.fromkeys(x))
    list_with_unique_words.append(unique_words)

print(list_with_unique_words)

输出[['clear'，'pending'，'order']，['pending'，'activation'，'clear']]

df["newlist"] = list_with_unique_words

df

`

Answer 6

解决方案==>

import pandas as pd
filepath = "C:/abc5/Python/Clustering/output2.csv"
df = pd.read_csv(filepath,encoding='windows-1252')
df["newlist"] = df["text_lemmatized"]
df['newlist'] = df['text_lemmatized'].map(lambda x: list(set(x)))
print(df)

感谢jezrael和其他所有帮助缩小此解决方案的人

从python数据框列表中删除重复项

6 个答案: