删除否定标记并在Spacy中返回否定的句子

时间:2019-03-03 15:53:42

标签: python spacy

我想使用spaCy的依赖解析器来确定文档中求反的范围。请参见here 依赖项可视化工具应用于以下字符串:

RT @trader $AAPL 2012 is ooopen to Talk about patents with GOOG definitely not the treatment Samsung got heh someURL

我能够检测到否定提示

 negation_tokens = [tok for tok in doc if tok.dep_ == 'neg']

结果,我发现 not 是字符串中 got 的否定修饰符。现在,我想用以下内容定义否定的范围:

negation_head_tokens = [token.head for token in negation_tokens]   
for token in negation_head_tokens:
    end = token.i
    start = token.head.i + 1
    negated_tokens = doc[start:end]
    print(negated_tokens)

这将提供以下输出:

 ooopen to Talk about patents with GOOG definitely not the treatment Samsung

现在我已经定义了范围,我想在某些带有POS标签的单词上添加“ not”

list = ['ADJ', 'ADV', 'AUX', 'VERB']
for token in negated_tokens:
    for i in list:
        if token.pos_ == i:
            print('not'+token.text)

这给出了以下内容:

 notooopen, notTalk, notdefinitely, notnot

我想从输出中排除 not 并返回

RT @trader $AAPL 2012 is notooopen to notTalk about patents with GOOG notdefinitely the treatment Samsung got heh someurl

我该如何实现?从速度角度来看,您是否看到我的脚本有所改进?

完整脚本:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(u'RT @trader $AAPL 2012 is ooopen to Talk about patents with GOOG definitely not the treatment Samsung got heh someURL)
list = ['ADJ', 'ADV', 'AUX', 'VERB']

negation_tokens = [tok for tok in doc if tok.dep_ == 'neg']
negation_head_tokens = [token.head for token in negation_tokens]

for token in negation_head_tokens:
   end = token.i
   start = token.head.i + 1
   negated_tokens = doc[start:end]
   for token in negated_tokens:
      for i in list:
         if token.pos_ == i:
            print('not'+token.text)

1 个答案:

答案 0 :(得分:1)

  1. 覆盖list之类的Python内置代码是一种不好的形式-我将其重命名为pos_list

  2. 由于“ not”只是常规的副词,因此避免它的最简单方法似乎是使用显式黑名单。也许有一种更“语言”的方式来做到这一点。

  3. 我稍微加快了你的内循环。

代码:

doc = nlp(u'RT @trader $AAPL 2012 is ooopen to Talk about patents with GOOG definitely not the treatment Samsung got heh someURL')

pos_list = ['ADJ', 'ADV', 'AUX', 'VERB']
negation_tokens = [tok for tok in doc if tok.dep_ == 'neg']
blacklist = [token.text for token in negation_tokens]
negation_head_tokens = [token.head for token in negation_tokens]
new_doc = []

for token in negation_head_tokens:
    end = token.i
    start = token.head.i + 1
    left, right = doc[:start], doc[:end] 
    negated_tokens = doc[start:end]
for token in doc:
    if token in negated_tokens:
        if token.pos_ in pos_list and token.text not in blacklist:

        # or you can leave out the blacklist and put it here directly
        # if token.pos_ in pos_list and token.text not in [token.text for token in negation_tokens]:
            new_doc.append('not'+token.text)
            continue
        else:
            pass
    new_doc.append(token.text)
print(' '.join(new_doc))

> RT @trader $ AAPL 2012 is notooopen to notTalk about patents with GOOG notdefinitely not the treatment Samsung got heh someURL