从熊猫列中删除重复的单词

时间:2019-07-02 13:46:10

标签: python pandas

我有一个数据框,其中包含如下信息

>>> Results.Category[:5]
0    issue delivery wrong master account
1      data wrong master account batch
2    order delivery wrong data account
3    issue delivery wrong master account
4    delivery wrong master account batch
Name: Category, dtype: object

现在我要在“类别”列中保留唯一的单词 例如 : 在第一行中出现单词“ wrong”,我想从其余所有行中删除它,而仅在第一行中保留单词“ wrong” 在第二行中有“数据”一词,然后我想从其余所有行中删除它,而仅在第二行中保留“数据”一词

我发现,如果行中有重复项,我们可以使用下面的内容删除,但是我需要从列中删除重复的单词,有人可以在这里帮助我。

AFResults['FinalCategoryN'] = AFResults['FinalCategory'].apply(lambda x: remove_dup(x))

4 个答案:

答案 0 :(得分:3)

似乎您想要类似的东西,

out = []
seen = set()
for c in df['Category']:
    words = c.split()
    out.append(' '.join([w for w in words if w not in seen]))
    seen.update(words)

df['FinalCategoryN'] = out
df

                              Category                       FinalCategoryN
0  issue delivery wrong master account  issue delivery wrong master account
1      data wrong master account batch                           data batch
2    order delivery wrong data account                                order
3  issue delivery wrong master account                                     
4  delivery wrong master account batch                                     

如果您不关心顺序,则可以使用set逻辑:

u = df['Category'].apply(str.split)
v = split.shift().map(lambda x: [] if x != x else x).cumsum().map(set)
(u.map(set) - v).str.join(' ')

0    account delivery issue master wrong
1                             batch data
2                                  order
3                                       
4                                       
Name: Category, dtype: object

答案 1 :(得分:2)

在这种情况下,您首先需要split,然后通过drop_duplicates删除重复项

df.c.str.split(expand=True).stack().drop_duplicates().\
     groupby(level=0).apply(','.join).reindex(df.index)
Out[206]: 
0    issue,delivery,wrong,master,account
1                             data,batch
2                                  order
3                                    NaN
4                                    NaN
dtype: object

答案 2 :(得分:1)

您无法向量化的内容,所以让我们忘了熊猫,然后使用Python set

total = set()
result = []
for line in AFResults['FinalCategory']:
    line = set(line.split()).difference(total)
    total = total.union(line)
    result.append(' '.join(line))

您将获得以下列表:['wrong issue master delivery account', 'batch data', 'order', '', '']

您可以使用它来填充数据框列:

AFResults['FinalCategoryN'] = result

答案 3 :(得分:0)

applysortedset以及str.joinlist.index一起使用:

AFResults['FinalCategoryN'] = AFResults['FinalCategory'].apply(lambda x: ' '.join(sorted(set(x.split()), key=x.index)))