删除除列表中的所有单词

时间:2019-03-13 09:58:35

标签: python pandas

我有一个如下所示的pandas数据框,其中包含单词的句子,还有一个名为vocab的列表,我想从句子中删除所有单词,但单词不在vocab列表中。

示例df:

                                 sentence
0  packag come differ what about tomorrow
1        Hello dear truth is hard to tell

vocab示例:

['packag', 'differ', 'tomorrow', 'dear', 'truth', 'hard', 'tell']

预期的O / P:

                                   sentence                  res
0   packag come differ what about tomorrow     packag differ tomorrow
1         Hello dear truth is hard to tell    dear truth hard tell

我首先尝试使用.str.replace并从句子中删除所有重要数据,然后将其存储到t1中。再次对t1和句子执行相同的操作,以便获得预期的输出。但是它没有按预期工作。

我的尝试

vocab_lis=['packag', 'differ', 'tomorrow', 'dear', 'truth', 'hard', 'tell']
vocab_regex = ' '+' | '.join(vocab_lis)+' '
df=pd.DataFrame()
s = pd.Series(["packag come differ what about tomorrow", "Hello dear truth is hard to tell"])
df['sentence']=s
df['sentence']= ' '+df['sentence']+' '

df['t1'] = df['sentence'].str.replace(vocab_regex, ' ')
df['t2'] = df.apply(lambda x: pd.Series(x['sentence']).str.replace(' | '.join(x['t1'].split()), ' '), axis=1)

有没有简单的方法可以完成上述任务? 我知道我的代码由于空格而无法正常工作。该如何解决?

2 个答案:

答案 0 :(得分:2)

使用嵌套列表理解,并按空格分隔:

df['res'] = [' '.join(y for y in x.split() if y in vocab_lis) for x in df['sentence']]
print (df)
                                 sentence                     res
0  packag come differ what about tomorrow  packag differ tomorrow
1        Hello dear truth is hard to tell    dear truth hard tell

vocab_regex = '|'.join(r"\b{}\b".format(x) for x in vocab_lis)
df['t1'] = df['sentence'].str.replace(vocab_regex, '')
print (df)
                                 sentence                  t1
0  packag come differ what about tomorrow   come  what about 
1        Hello dear truth is hard to tell     Hello   is  to

答案 1 :(得分:2)

使用np.array

数据

                                   sentence
0    packag come differ what about tomorrow
1          Hello dear truth is hard to tell

词汇

v = ['packag', 'differ', 'tomorrow', 'dear', 'truth', 'hard', 'tell']

首先将句子拆分成一个列表,然后使用np.in1d检查两个列表之间的共同元素。然后只需将列表加入字符串即可

data['sentence'] = data['sentence'].apply(lambda x: ' '.join(np.array(x.split(' '))[np.in1d(x.split(' '),v)]))

输出

                                   sentence                     res
0    packag come differ what about tomorrow  packag differ tomorrow
1          Hello dear truth is hard to tell    dear truth hard tell