这是我的数据集
Id Text
1. Dear Mr. John, your bag order is delivered
2. Dear Mr. Brick, your ball order is delivered
3. Dear Mrs. Blue, your ball purchase is delivered
我需要的是
Id Text
1. Dear Mr. your order is delivered
2. Dear Mr. your ball order is delivered
3. Dear your ball is delivered
所以只出现一次的单词被删除了
答案 0 :(得分:3)
使用:
#split to words and create Series
all_val = df['Text'].str.split(expand=True).stack()
#remove duplicates and join together per first level of MultiIndex
df['Text'] = all_val[all_val.duplicated(keep=False)].groupby(level=0).apply(' '.join)
print (df)
Id Text
0 1.0 Dear Mr. your order is delivered
1 2.0 Dear Mr. your ball order is delivered
2 3.0 Dear your ball is delivered
或者:
#join all text together and split by whitespaces
all_val = ' '.join(df['Text']).split()
#get unique values
once = [x for x in all_val if all_val.count(x) == 1]
#remove from text by nested list comprehension
df['Text'] = [' '.join([y for y in x.split() if y not in once]) for x in df['Text']]
#apply alternative
#df['Text'] = df['Text'].apply(lambda x: ' '.join([y for y in x.split() if y not in once]))
print (df)
Id Text
0 1.0 Dear Mr. your order is delivered
1 2.0 Dear Mr. your ball order is delivered
2 3.0 Dear your ball is delivered
答案 1 :(得分:1)
你可以做
In [78]: s = pd.Series(df.Text.str.cat(sep=' ').split()).value_counts()
In [79]: exp = '|'.join(s[s.eq(1)].index)
In [80]: df.Text.str.replace(exp, '').str.replace('\s\s+', ' ')
Out[80]:
0 Dear Mr. your order is delivered
1 Dear Mr. your ball order is delivered
2 Dear your ball is delivered
Name: Text, dtype: object