我想使用spaCy的依赖解析器来确定文档中求反的范围。请参见here 依赖项可视化工具应用于以下字符串:
RT @trader $AAPL 2012 is ooopen to Talk about patents with GOOG definitely not the treatment Samsung got heh someURL
我能够检测到否定提示
negation_tokens = [tok for tok in doc if tok.dep_ == 'neg']
结果,我发现 not 是字符串中 got 的否定修饰符。现在,我想用以下内容定义否定的范围:
negation_head_tokens = [token.head for token in negation_tokens]
for token in negation_head_tokens:
end = token.i
start = token.head.i + 1
negated_tokens = doc[start:end]
print(negated_tokens)
这将提供以下输出:
ooopen to Talk about patents with GOOG definitely not the treatment Samsung
现在我已经定义了范围,我想在某些带有POS标签的单词上添加“ not”
list = ['ADJ', 'ADV', 'AUX', 'VERB']
for token in negated_tokens:
for i in list:
if token.pos_ == i:
print('not'+token.text)
这给出了以下内容:
notooopen, notTalk, notdefinitely, notnot
我想从输出中排除 not 并返回
RT @trader $AAPL 2012 is notooopen to notTalk about patents with GOOG notdefinitely the treatment Samsung got heh someurl
我该如何实现?从速度角度来看,您是否看到我的脚本有所改进?
完整脚本:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(u'RT @trader $AAPL 2012 is ooopen to Talk about patents with GOOG definitely not the treatment Samsung got heh someURL)
list = ['ADJ', 'ADV', 'AUX', 'VERB']
negation_tokens = [tok for tok in doc if tok.dep_ == 'neg']
negation_head_tokens = [token.head for token in negation_tokens]
for token in negation_head_tokens:
end = token.i
start = token.head.i + 1
negated_tokens = doc[start:end]
for token in negated_tokens:
for i in list:
if token.pos_ == i:
print('not'+token.text)
答案 0 :(得分:1)
覆盖list
之类的Python内置代码是一种不好的形式-我将其重命名为pos_list
。
由于“ not”只是常规的副词,因此避免它的最简单方法似乎是使用显式黑名单。也许有一种更“语言”的方式来做到这一点。
我稍微加快了你的内循环。
代码:
doc = nlp(u'RT @trader $AAPL 2012 is ooopen to Talk about patents with GOOG definitely not the treatment Samsung got heh someURL')
pos_list = ['ADJ', 'ADV', 'AUX', 'VERB']
negation_tokens = [tok for tok in doc if tok.dep_ == 'neg']
blacklist = [token.text for token in negation_tokens]
negation_head_tokens = [token.head for token in negation_tokens]
new_doc = []
for token in negation_head_tokens:
end = token.i
start = token.head.i + 1
left, right = doc[:start], doc[:end]
negated_tokens = doc[start:end]
for token in doc:
if token in negated_tokens:
if token.pos_ in pos_list and token.text not in blacklist:
# or you can leave out the blacklist and put it here directly
# if token.pos_ in pos_list and token.text not in [token.text for token in negation_tokens]:
new_doc.append('not'+token.text)
continue
else:
pass
new_doc.append(token.text)
print(' '.join(new_doc))
> RT @trader $ AAPL 2012 is notooopen to notTalk about patents with GOOG notdefinitely not the treatment Samsung got heh someURL