Question

我正在一个文本分类项目中，我需要将一个句子拆分成多个单词，以便可以计算出它是正数还是负数的可能性。问题是"not"一词，无论何时出现，它都会将原本肯定的句子改为否定的句子，但我的系统仍将句子归类为肯定的句子，这使它出错。

我的想法是找到一种方法，将'not'

除外的句子分解为单词

例如" she is not beautiful "

而不是得到"she", "is", "not", "beautiful"

我想得到"she", "is", "not beautiful "

Answer 1

您可以将"not"后面的re.split用作反面。

import re
mystr = "she is not beautiful"
re.split("(?<!not)\s", mystr)
#['she', 'is', 'not beautiful']

正则表达式模式为：

(?<!not)："not"的负向后看
\s：任何空白字符

Answer 2

您也可以尝试

以'not'分隔文本
获取新列表中的第一个元素并将其拆分，然后将其添加到要返回的另一个列表中
用于步骤1中列表的其他元素。我们拆分每个项目，而不添加到第一项。

def my_seperator(text):
    text = text.strip()
    my_text = []
    text = text.split('not')
    my_text = my_text + text[0].split()
    for t in text[1:]:
        temp_text = t.split()
        my_text.append('not '+temp_text[0])
        my_text = my_text+temp_text[1:]
    return my_text

>>> my_seperator('she is not beautiful . but not that she is ugly. Maybe she is not my type')
['she', 'is', 'not beautiful', '.', 'but', 'not that', 'she', 'is', 'ugly.', 'Maybe', 'she', 'is', 'not my', 'type']

尽管正则表达式像@pault mentioned一样是可行的方式。

如何在某些例外情况下将句子分成单词

2 个答案: