Question

我有一个数据框，其中的变量之一是一个较长的段落，包含许多句子。有时句子之间用句号分隔，有时用逗号分隔。我正在尝试通过使用所选单词仅提取文本的所选部分来创建新变量。请在下面查看数据框的简短示例，了解我目前的结果，以及我正在使用的代码。注意-第一个变量中的文本很大。

PhysicalMentalDemands           Physical_driving       Physical_telephones

[driving may be necessary       [driving......]        [telephones...]
occasionally. 
as well as telephones will also 
be occasional to frequent.]

使用的代码：

searched_words = ['driving' , 'telephones']

for i in searched_words:
  Test ['Physical' +"_"+  str(i)] = Test ['PhysicalMentalDemands'].apply(lambda text: [sent for sent in sent_tokenize(text)
                       if any(True for w in word_tokenize(sent) 
                                 if w.lower() in searched_words)])

问题：

目前，我的代码提取了句子，但同时使用了两个单词。我似乎有其他类似的帖子，但没有一个能够解决我的问题。

已修复

searched_words = ['开车'，'身体']

for i in searched_words:
df['Physical' + '_' + i] = result['PhysicalMentalDemands'].str.lower().apply(lambda text: [sent for sent in sent_tokenize(text) 
                                                           if i in word_tokenize(sent)])

Answer 1

如果您想为每个搜索到的单词单独列出，则可以考虑将代码重新组织为如下形式：

searched_words = ['driving', 'telephones']

for searched_word in searched_words:
    Test ['Physical' +"_"+  searched_word)] = Test ['PhysicalMentalDemands'].apply(lambda text: [sent for sent in sent_tokenize(text)
                if any(w for w in word_tokenize(sent) if w.lower() == searched_word)])

请注意，此修复程序的内容正在从if w.lower() in searched_words更改为if w.lower() == searched_word。

Python：创建一个新变量，该变量源自从文本中提取句子

1 个答案: