我正在努力研究一个句子中的主题提取,以便我可以根据主题获得情感。为此,我在python2.7中使用nltk
。以下面的句子为例:
Donald Trump is the worst president of USA, but Hillary is better than him
我们可以看到Donald Trump
和Hillary
是两个主题,与Donald Trump
相关的情绪是否定的,但与Hillary
相关的情绪是正面的。直到现在,我能够将这句话分成大块的名词短语,我能够得到以下内容:
(S
(NP Donald/NNP Trump/NNP)
is/VBZ
(NP the/DT worst/JJS president/NN)
in/IN
(NP USA,/NNP)
but/CC
(NP Hillary/NNP)
is/VBZ
better/JJR
than/IN
(NP him/PRP))
现在,我如何从这些名词短语中找到主题?那么如何将两个主题的短语组合在一起呢?一旦我将短语分别用于两个主题,我就可以分别对它们进行情感分析。
修改
我查看了@Krzysiek(spacy
)提到的库,它在句子中也给了我依赖树。
以下是代码:
from spacy.en import English
parser = English()
example = u"Donald Trump is the worst president of USA, but Hillary is better than him"
parsedEx = parser(example)
# shown as: original token, dependency tag, head word, left dependents, right dependents
for token in parsedEx:
print(token.orth_, token.dep_, token.head.orth_, [t.orth_ for t in token.lefts], [t.orth_ for t in token.rights])
以下是依赖树:
(u'Donald', u'compound', u'Trump', [], [])
(u'Trump', u'nsubj', u'is', [u'Donald'], [])
(u'is', u'ROOT', u'is', [u'Trump'], [u'president', u',', u'but', u'is'])
(u'the', u'det', u'president', [], [])
(u'worst', u'amod', u'president', [], [])
(u'president', u'attr', u'is', [u'the', u'worst'], [u'of'])
(u'of', u'prep', u'president', [], [u'USA'])
(u'USA', u'pobj', u'of', [], [])
(u',', u'punct', u'is', [], [])
(u'but', u'cc', u'is', [], [])
(u'Hillary', u'nsubj', u'is', [], [])
(u'is', u'conj', u'is', [u'Hillary'], [u'better'])
(u'better', u'acomp', u'is', [], [u'than'])
(u'than', u'prep', u'better', [], [u'him'])
(u'him', u'pobj', u'than', [], [])
这深入洞察了句子不同标记的依赖关系。以下是本文的link,它描述了不同对之间的依赖关系。如何使用此树将不同主题的上下文单词附加到它们?
答案 0 :(得分:7)
我最近刚刚解决了非常类似的问题 - 我需要提取主题,动作,对象。我开源我的工作,所以你可以检查这个库: https://github.com/krzysiekfonal/textpipeliner
这基于spacy(nltk的反对者),但它也基于句子树。
例如,让我们将此文档嵌入spacy中作为示例:
SELECT t1.id,
t1.name
FROM emp_table t1
INNER JOIN -- the join will remove employees who do not have
( -- two phone numbers in phone_table
SELECT Id
FROM phone_table -- this query identifies all employees
GROUP BY Id -- having two phone numbers in phone_table
HAVING COUNT(*) = 2
) t2
ON t1.id = t2.Id
您现在可以创建一个简单的管道结构(更多关于此项目自述文件中的管道):
import spacy
nlp = spacy.load("en")
doc = nlp(u"The Empire of Japan aimed to dominate Asia and the " \
"Pacific and was already at war with the Republic of China " \
"in 1937, but the world war is generally said to have begun on " \
"1 September 1939 with the invasion of Poland by Germany and " \
"subsequent declarations of war on Germany by France and the United Kingdom. " \
"From late 1939 to early 1941, in a series of campaigns and treaties, Germany conquered " \
"or controlled much of continental Europe, and formed the Axis alliance with Italy and Japan. " \
"Under the Molotov-Ribbentrop Pact of August 1939, Germany and the Soviet Union partitioned and " \
"annexed territories of their European neighbours, Poland, Finland, Romania and the Baltic states. " \
"The war continued primarily between the European Axis powers and the coalition of the United Kingdom " \
"and the British Commonwealth, with campaigns including the North Africa and East Africa campaigns, " \
"the aerial Battle of Britain, the Blitz bombing campaign, the Balkan Campaign as well as the " \
"long-running Battle of the Atlantic. In June 1941, the European Axis powers launched an invasion " \
"of the Soviet Union, opening the largest land theatre of war in history, which trapped the major part " \
"of the Axis' military forces into a war of attrition. In December 1941, Japan attacked " \
"the United States and European territories in the Pacific Ocean, and quickly conquered much of " \
"the Western Pacific.")
结果你会得到:
pipes_structure = [SequencePipe([FindTokensPipe("VERB/nsubj/*"),
NamedEntityFilterPipe(),
NamedEntityExtractorPipe()]),
FindTokensPipe("VERB"),
AnyPipe([SequencePipe([FindTokensPipe("VBD/dobj/NNP"),
AggregatePipe([NamedEntityFilterPipe("GPE"),
NamedEntityFilterPipe("PERSON")]),
NamedEntityExtractorPipe()]),
SequencePipe([FindTokensPipe("VBD/**/*/pobj/NNP"),
AggregatePipe([NamedEntityFilterPipe("LOC"),
NamedEntityFilterPipe("PERSON")]),
NamedEntityExtractorPipe()])])]
engine = PipelineEngine(pipes_structure, Context(doc), [0,1,2])
engine.process()
实际上它基于另一个库(发现管道) - grammaregex。你可以从帖子中读到它: https://medium.com/@krzysiek89dev/grammaregex-library-regex-like-for-text-mining-49e5706c9c6d#.zgx7odhsc
<强> EDITED 强>
实际上,我在自述文件中提供的示例丢弃了adj,但您只需根据需要调整传递给引擎的管道结构。 例如,对于你的例句,我可以提出这样的结构/解决方案,每个句子给你3个元素(subj,动词,adj)的元组:
>>>[([Germany], [conquered], [Europe]),
([Japan], [attacked], [the, United, States])]
它会给你结果:
import spacy
from textpipeliner import PipelineEngine, Context
from textpipeliner.pipes import *
pipes_structure = [SequencePipe([FindTokensPipe("VERB/nsubj/NNP"),
NamedEntityFilterPipe(),
NamedEntityExtractorPipe()]),
AggregatePipe([FindTokensPipe("VERB"),
FindTokensPipe("VERB/xcomp/VERB/aux/*"),
FindTokensPipe("VERB/xcomp/VERB")]),
AnyPipe([FindTokensPipe("VERB/[acomp,amod]/ADJ"),
AggregatePipe([FindTokensPipe("VERB/[dobj,attr]/NOUN/det/DET"),
FindTokensPipe("VERB/[dobj,attr]/NOUN/[acomp,amod]/ADJ")])])
]
engine = PipelineEngine(pipes_structure, Context(doc), [0,1,2])
engine.process()
有点复杂的事实是你有复合句并且lib每个句子产生一个元组 - 我很快就会增加可能性(我的项目也需要它)将管道结构列表传递给引擎以允许每句话产生更多元组。但是现在你可以通过为复合的sents创建第二个引擎来解决它,结构将只有VERB / conj / VERB而不是VERB(这些正则表达式总是从ROOT开始,所以VERB / conj / VERB引导你只是第二个动词复句):
[([Donald, Trump], [is], [the, worst])]
现在,在你运行两个引擎后,你将得到预期的结果:)
pipes_structure_comp = [SequencePipe([FindTokensPipe("VERB/conj/VERB/nsubj/NNP"),
NamedEntityFilterPipe(),
NamedEntityExtractorPipe()]),
AggregatePipe([FindTokensPipe("VERB/conj/VERB"),
FindTokensPipe("VERB/conj/VERB/xcomp/VERB/aux/*"),
FindTokensPipe("VERB/conj/VERB/xcomp/VERB")]),
AnyPipe([FindTokensPipe("VERB/conj/VERB/[acomp,amod]/ADJ"),
AggregatePipe([FindTokensPipe("VERB/conj/VERB/[dobj,attr]/NOUN/det/DET"),
FindTokensPipe("VERB/conj/VERB/[dobj,attr]/NOUN/[acomp,amod]/ADJ")])])
]
engine2 = PipelineEngine(pipes_structure_comp, Context(doc), [0,1,2])
这就是我想要的。当然,我只是快速为给定的例句创建了一个管道结构,它不适用于每个案例,但我看到了很多句子结构,它已经实现了相当不错的百分比,但是你可以添加更多的FindTokensPipe等目前不起作用的案例,我相信经过一些调整后你会覆盖很多可能的句子(英语不是太复杂,所以...:)
答案 1 :(得分:5)
我更多地访问了spacy库,最后我通过依赖管理找到了解决方案。感谢this repo,我想出了如何在我的主观动词对象(使其成为SVAO)中包含形容词,以及在查询中取出复合主语。这是我的解决方案:
from nltk.stem.wordnet import WordNetLemmatizer
from spacy.lang.en import English
SUBJECTS = ["nsubj", "nsubjpass", "csubj", "csubjpass", "agent", "expl"]
OBJECTS = ["dobj", "dative", "attr", "oprd"]
ADJECTIVES = ["acomp", "advcl", "advmod", "amod", "appos", "nn", "nmod", "ccomp", "complm",
"hmod", "infmod", "xcomp", "rcmod", "poss"," possessive"]
COMPOUNDS = ["compound"]
PREPOSITIONS = ["prep"]
def getSubsFromConjunctions(subs):
moreSubs = []
for sub in subs:
# rights is a generator
rights = list(sub.rights)
rightDeps = {tok.lower_ for tok in rights}
if "and" in rightDeps:
moreSubs.extend([tok for tok in rights if tok.dep_ in SUBJECTS or tok.pos_ == "NOUN"])
if len(moreSubs) > 0:
moreSubs.extend(getSubsFromConjunctions(moreSubs))
return moreSubs
def getObjsFromConjunctions(objs):
moreObjs = []
for obj in objs:
# rights is a generator
rights = list(obj.rights)
rightDeps = {tok.lower_ for tok in rights}
if "and" in rightDeps:
moreObjs.extend([tok for tok in rights if tok.dep_ in OBJECTS or tok.pos_ == "NOUN"])
if len(moreObjs) > 0:
moreObjs.extend(getObjsFromConjunctions(moreObjs))
return moreObjs
def getVerbsFromConjunctions(verbs):
moreVerbs = []
for verb in verbs:
rightDeps = {tok.lower_ for tok in verb.rights}
if "and" in rightDeps:
moreVerbs.extend([tok for tok in verb.rights if tok.pos_ == "VERB"])
if len(moreVerbs) > 0:
moreVerbs.extend(getVerbsFromConjunctions(moreVerbs))
return moreVerbs
def findSubs(tok):
head = tok.head
while head.pos_ != "VERB" and head.pos_ != "NOUN" and head.head != head:
head = head.head
if head.pos_ == "VERB":
subs = [tok for tok in head.lefts if tok.dep_ == "SUB"]
if len(subs) > 0:
verbNegated = isNegated(head)
subs.extend(getSubsFromConjunctions(subs))
return subs, verbNegated
elif head.head != head:
return findSubs(head)
elif head.pos_ == "NOUN":
return [head], isNegated(tok)
return [], False
def isNegated(tok):
negations = {"no", "not", "n't", "never", "none"}
for dep in list(tok.lefts) + list(tok.rights):
if dep.lower_ in negations:
return True
return False
def findSVs(tokens):
svs = []
verbs = [tok for tok in tokens if tok.pos_ == "VERB"]
for v in verbs:
subs, verbNegated = getAllSubs(v)
if len(subs) > 0:
for sub in subs:
svs.append((sub.orth_, "!" + v.orth_ if verbNegated else v.orth_))
return svs
def getObjsFromPrepositions(deps):
objs = []
for dep in deps:
if dep.pos_ == "ADP" and dep.dep_ == "prep":
objs.extend([tok for tok in dep.rights if tok.dep_ in OBJECTS or (tok.pos_ == "PRON" and tok.lower_ == "me")])
return objs
def getAdjectives(toks):
toks_with_adjectives = []
for tok in toks:
adjs = [left for left in tok.lefts if left.dep_ in ADJECTIVES]
adjs.append(tok)
adjs.extend([right for right in tok.rights if tok.dep_ in ADJECTIVES])
tok_with_adj = " ".join([adj.lower_ for adj in adjs])
toks_with_adjectives.extend(adjs)
return toks_with_adjectives
def getObjsFromAttrs(deps):
for dep in deps:
if dep.pos_ == "NOUN" and dep.dep_ == "attr":
verbs = [tok for tok in dep.rights if tok.pos_ == "VERB"]
if len(verbs) > 0:
for v in verbs:
rights = list(v.rights)
objs = [tok for tok in rights if tok.dep_ in OBJECTS]
objs.extend(getObjsFromPrepositions(rights))
if len(objs) > 0:
return v, objs
return None, None
def getObjFromXComp(deps):
for dep in deps:
if dep.pos_ == "VERB" and dep.dep_ == "xcomp":
v = dep
rights = list(v.rights)
objs = [tok for tok in rights if tok.dep_ in OBJECTS]
objs.extend(getObjsFromPrepositions(rights))
if len(objs) > 0:
return v, objs
return None, None
def getAllSubs(v):
verbNegated = isNegated(v)
subs = [tok for tok in v.lefts if tok.dep_ in SUBJECTS and tok.pos_ != "DET"]
if len(subs) > 0:
subs.extend(getSubsFromConjunctions(subs))
else:
foundSubs, verbNegated = findSubs(v)
subs.extend(foundSubs)
return subs, verbNegated
def getAllObjs(v):
# rights is a generator
rights = list(v.rights)
objs = [tok for tok in rights if tok.dep_ in OBJECTS]
objs.extend(getObjsFromPrepositions(rights))
potentialNewVerb, potentialNewObjs = getObjFromXComp(rights)
if potentialNewVerb is not None and potentialNewObjs is not None and len(potentialNewObjs) > 0:
objs.extend(potentialNewObjs)
v = potentialNewVerb
if len(objs) > 0:
objs.extend(getObjsFromConjunctions(objs))
return v, objs
def getAllObjsWithAdjectives(v):
# rights is a generator
rights = list(v.rights)
objs = [tok for tok in rights if tok.dep_ in OBJECTS]
if len(objs)== 0:
objs = [tok for tok in rights if tok.dep_ in ADJECTIVES]
objs.extend(getObjsFromPrepositions(rights))
potentialNewVerb, potentialNewObjs = getObjFromXComp(rights)
if potentialNewVerb is not None and potentialNewObjs is not None and len(potentialNewObjs) > 0:
objs.extend(potentialNewObjs)
v = potentialNewVerb
if len(objs) > 0:
objs.extend(getObjsFromConjunctions(objs))
return v, objs
def findSVOs(tokens):
svos = []
verbs = [tok for tok in tokens if tok.pos_ == "VERB" and tok.dep_ != "aux"]
for v in verbs:
subs, verbNegated = getAllSubs(v)
# hopefully there are subs, if not, don't examine this verb any longer
if len(subs) > 0:
v, objs = getAllObjs(v)
for sub in subs:
for obj in objs:
objNegated = isNegated(obj)
svos.append((sub.lower_, "!" + v.lower_ if verbNegated or objNegated else v.lower_, obj.lower_))
return svos
def findSVAOs(tokens):
svos = []
verbs = [tok for tok in tokens if tok.pos_ == "VERB" and tok.dep_ != "aux"]
for v in verbs:
subs, verbNegated = getAllSubs(v)
# hopefully there are subs, if not, don't examine this verb any longer
if len(subs) > 0:
v, objs = getAllObjsWithAdjectives(v)
for sub in subs:
for obj in objs:
objNegated = isNegated(obj)
obj_desc_tokens = generate_left_right_adjectives(obj)
sub_compound = generate_sub_compound(sub)
svos.append((" ".join(tok.lower_ for tok in sub_compound), "!" + v.lower_ if verbNegated or objNegated else v.lower_, " ".join(tok.lower_ for tok in obj_desc_tokens)))
return svos
def generate_sub_compound(sub):
sub_compunds = []
for tok in sub.lefts:
if tok.dep_ in COMPOUNDS:
sub_compunds.extend(generate_sub_compound(tok))
sub_compunds.append(sub)
for tok in sub.rights:
if tok.dep_ in COMPOUNDS:
sub_compunds.extend(generate_sub_compound(tok))
return sub_compunds
def generate_left_right_adjectives(obj):
obj_desc_tokens = []
for tok in obj.lefts:
if tok.dep_ in ADJECTIVES:
obj_desc_tokens.extend(generate_left_right_adjectives(tok))
obj_desc_tokens.append(obj)
for tok in obj.rights:
if tok.dep_ in ADJECTIVES:
obj_desc_tokens.extend(generate_left_right_adjectives(tok))
return obj_desc_tokens
现在,当您传递查询时:
from spacy.lang.en import English
parser = English()
sentence = u"""
Donald Trump is the worst president of USA, but Hillary is better than him
"""
parse = parser(sentence)
print(findSVAOs(parse))
您将获得以下内容:
[(u'donald trump', u'is', u'worst president'), (u'hillary', u'is', u'better')]
感谢@Krzysiek提供的解决方案,我实际上无法深入到您的库中进行修改。我宁愿尝试修改上面提到的链接以解决我的问题。