Question

输入文本总是菜名列表，其中有1~3个形容词和一个名词

输入

thai iced tea
spicy fried chicken
sweet chili pork
thai chicken curry

输出：

thai tea, iced tea
spicy chicken, fried chicken
sweet pork, chili pork
thai chicken, chicken curry, thai curry

基本上，我希望解析句子树并尝试通过将形容词与名词配对来生成二元词。

我想用spacy或nltk实现这个目标

Answer 1

我使用spacy 2.0和英文模型。寻找名词和＆＃34; not-nouns＆＃34;解析输入然后我把非名词和名词放在一起创建一个所需的输出。

您的意见：

s = ["thai iced tea",
"spicy fried chicken",
"sweet chili pork",
"thai chicken curry",]

Spacy解决方案：

import spacy
nlp = spacy.load('en') # import spacy, load model

def noun_notnoun(phrase):
    doc = nlp(phrase) # create spacy object
    token_not_noun = []
    notnoun_noun_list = []

    for item in doc:
        if item.pos_ != "NOUN": # separate nouns and not nouns
            token_not_noun.append(item.text)
        if item.pos_ == "NOUN":
            noun = item.text

    for notnoun in token_not_noun:
        notnoun_noun_list.append(notnoun + " " + noun)

    return notnoun_noun_list

通话功能：

for phrase in s:
    print(noun_notnoun(phrase))

结果：

['thai tea', 'iced tea']
['spicy chicken', 'fried chicken']
['sweet pork', 'chili pork']
['thai chicken', 'curry chicken']

Answer 2

您可以使用NLTK在几个步骤中实现此目的：

PoS标记序列
生成所需的n-gram（在你的例子中没有三元组，但是可以通过三元组生成并且然后打出中间标记的skip-gram）
丢弃与 JJ NN 模式不匹配的所有n-gram。

示例：

def jjnn_pairs(phrase):
    '''
    Iterate over pairs of JJ-NN.
    '''
    tagged = nltk.pos_tag(nltk.word_tokenize(phrase))
    for ngram in ngramise(tagged):
        tokens, tags = zip(*ngram)
        if tags == ('JJ', 'NN'):
            yield tokens

def ngramise(sequence):
    '''
    Iterate over bigrams and 1,2-skip-grams.
    '''
    for bigram in nltk.ngrams(sequence, 2):
        yield bigram
    for trigram in nltk.ngrams(sequence, 3):
        yield trigram[0], trigram[2]

根据您的需要扩展模式('JJ', 'NN')和所需的n-gram。

我认为不需要解析。然而，这种方法的主要问题是大多数PoS标记器可能不会按照您想要的方式标记所有内容。例如，我的NLTK安装的默认PoS标记器将“chili”标记为 NN ，而不是 JJ ，并且“fried”标记为 VBD 。但是，解析对你没有帮助！

Answer 3

这样的事情：

>>> from nltk import bigrams
>>> text = """thai iced tea
... spicy fried chicken
... sweet chili pork
... thai chicken curry"""
>>> lines = map(str.split, text.split('\n'))
>>> for line in lines:
...     ", ".join([" ".join(bi) for bi in bigrams(line)])
... 
'thai iced, iced tea'
'spicy fried, fried chicken'
'sweet chili, chili pork'
'thai chicken, chicken curry'

或者使用colibricore https://proycon.github.io/colibri-core/doc/#installation; P

如何使用spacy / nltk生成bi / tri-gram

3 个答案: