希望从句子中提取复合名词-形容词对。所以,基本上我想要这样的东西:

时间:2018-07-12 14:38:20

标签: python nltk spacy

对于形容词:

"The company's customer service was terrible."
{customer service, terrible}

动词:

"They kept increasing my phone bill"
{phone bill, increasing}

这是来自this posting

的分支问题

但是我试图使用spacy查找与多令牌短语/复合名词(例如“客户服务”)相对应的adj和动词。

我不确定如何使用spacy,nltk或任何其他预包装的自然语言处理软件来做到这一点,我将不胜感激!

1 个答案:

答案 0 :(得分:4)

对于像这样的简单示例,您可以将spaCy的dependency parsing与一些简单规则结合使用。

首先,要识别类似于给定示例的多词名词,可以使用“ compound”依赖项。在使用spaCy解析文档(例如句子)之后,请使用令牌的dep_属性来查找其依赖项。

例如,这个句子有两个复合名词:

  

“复合依赖项标识复合名词。”

每个令牌及其依赖性如下所示:

import spacy
import pandas as pd
nlp = spacy.load('en')

example_doc = nlp("The compound dependency identifies compound nouns.")
for tok in example_doc:
    print(tok.i, tok, "[", tok.dep_, "]")

>>>0 The [ det ]
>>>1 compound [ compound ]
>>>2 dependency [ nsubj ]
>>>3 identifies [ ROOT ]
>>>4 compound [ compound ]
>>>5 nouns [ dobj ]
>>>6 . [ punct ]
for tok in [tok for tok in example_doc if tok.dep_ == 'compound']: # Get list of 
compounds in doc
    noun = example_doc[tok.i: tok.head.i + 1]
    print(noun)
>>>compound dependency
>>>compound nouns

以下功能适用于您的示例。但是,它可能不适用于更复杂的句子。

adj_doc = nlp("The company's customer service was terrible.")
verb_doc = nlp("They kept increasing my phone bill")

def get_compound_pairs(doc, verbose=False):
    """Return tuples of (multi-noun word, adjective or verb) for document."""
    compounds = [tok for tok in doc if tok.dep_ == 'compound'] # Get list of compounds in doc
    compounds = [c for c in compounds if c.i == 0 or doc[c.i - 1].dep_ != 'compound'] # Remove middle parts of compound nouns, but avoid index errors
    tuple_list = []
    if compounds: 
        for tok in compounds:
            pair_item_1, pair_item_2 = (False, False) # initialize false variables
            noun = doc[tok.i: tok.head.i + 1]
            pair_item_1 = noun
            # If noun is in the subject, we may be looking for adjective in predicate
            # In simple cases, this would mean that the noun shares a head with the adjective
            if noun.root.dep_ == 'nsubj':
                adj_list = [r for r in noun.root.head.rights if r.pos_ == 'ADJ']
                if adj_list:
                    pair_item_2 = adj_list[0] 
                if verbose == True: # For trying different dependency tree parsing rules
                    print("Noun: ", noun)
                    print("Noun root: ", noun.root)
                    print("Noun root head: ", noun.root.head)
                    print("Noun root head rights: ", [r for r in noun.root.head.rights if r.pos_ == 'ADJ'])
            if noun.root.dep_ == 'dobj':
                verb_ancestor_list = [a for a in noun.root.ancestors if a.pos_ == 'VERB']
                if verb_ancestor_list:
                    pair_item_2 = verb_ancestor_list[0]
                if verbose == True: # For trying different dependency tree parsing rules
                    print("Noun: ", noun)
                    print("Noun root: ", noun.root)
                    print("Noun root head: ", noun.root.head)
                    print("Noun root head verb ancestors: ", [a for a in noun.root.ancestors if a.pos_ == 'VERB'])
            if pair_item_1 and pair_item_2:
                tuple_list.append((pair_item_1, pair_item_2))
    return tuple_list

get_compound_pairs(adj_doc)
>>>[(customer service, terrible)]
get_compound_pairs(verb_doc)
>>>[(phone bill, increasing)]
get_compound_pairs(example_doc, verbose=True)
>>>Noun:  compound dependency
>>>Noun root:  dependency
>>>Noun root head:  identifies
>>>Noun root head rights:  []
>>>Noun:  compound nouns
>>>Noun root:  nouns
>>>Noun root head:  identifies
>>>Noun root head verb ancestors:  [identifies]
>>>[(compound nouns, identifies)]