有什么有效的解决方法可以使用TF-IDF方案从给定的句子中提取关键短语?

时间:2018-12-01 01:30:57

标签: python nlp nltk text-extraction

我正在尝试使用TF-IDF模式从给定的句子中提取关键短语。为此,我尝试找出句子中的候选单词或候选短语,然后在句子中使用“获取常用单词”。但是,当我引入新的CFG规则以查找句子中可能的关键短语时,我遇到了错误。

这是我的剧本:

rm_punct=re.compile('[{}]'.format(re.escape(string.punctuation)))
stop_words=set(stopwords.words('english'))

def get_cand_words(sent, cand_type='word', remove_punct=False):
    candidates=list()
    sent=rm_punct.sub(' ', sent)
    tokenized=word_tokenize(sent)
    tagged_words=pos_tag(tokenized)
    if cand_type=='word':
        pos_tag_patt=tags = set(['JJ', 'JJR', 'JJS', 'NN', 'NNP', 'NNS', 'NNPS'])
        tagged_words=chain.from_iterable(tagged_words)
        for word, tag in enumerate(tagged_words):
            if tag in pos_tag_patt and word not in stop_words:
                candidates.append(word)

    elif cand_type == 'phrase':
        grammar = r'KT: {(<JJ>* <NN.*>+ <IN>)? <JJ>* <NN.*>+}'
        chunker = RegexpParser(grammar)
        all_tag = chain.from_iterable([chunker.parse(tag) for tag in tagged_words])
        for key, group in groupby(all_tag, lambda tag: tag[2] != 'O'):
            candidate = ' '.join([word for (word, pos, chunk) in group])
            if key is True and candidate not in stop_words:
                candidates.append(candidate)
    else:
        print("return word or phrase as target phrase")
    return candidates

这是python引发的错误:

sentence_1="Hillary Clinton agrees with John McCain by voting to give George Bush the benefit of the doubt on Iran."

sentence_2="The United States has the highest corporate tax rate in the free world"

get_cand_words(sent=sentence_1, cand_type='phrase', remove_punct=False)

ValueError: chunk structures must contain tagged tokens or trees

我基于从长文本段落中提取关键短语来启发上面的代码,我的目标是要在给定的句子中找到一个唯一的关键短语,但是上述实现效果不佳。

如何解决此值错误?如何使上述实现能够在给定的句子或句子列表中提取关键短语?有什么更好的主意可以做到这一点?还有其他想法吗?谢谢

目标

我想从给定的句子中找出最相关的名词形容词短语或复合名词形容词短语。如何在python中完成此操作?有谁知道如何做到这一点?预先感谢

1 个答案:

答案 0 :(得分:-2)

您可以尝试使用此代码吗?

   rm_punct=re.compile('[{}]'.format(re.escape(string.punctuation)))
   stop_words=set(stopwords.words('english'))

   def get_cand_words(sent, cand_type='word', remove_punct=False):
    import nltk
    candidates=list()
    sent=rm_punct.sub(' ', sent)
    tokenized=word_tokenize(sent)
    tagged_words=pos_tag(tokenized)
    if cand_type=='word':
        pos_tag_patt=tags = set(['JJ', 'JJR', 'JJS', 'NN', 'NNP', 'NNS', 'NNPS'])
        tagged_words=chain.from_iterable(tagged_words)
        for word, tag in enumerate(tagged_words):
            if tag in pos_tag_patt and word not in stop_words:
                candidates.append(word)

    elif cand_type == 'phrase':
        grammar = r'KT: {(<JJ>* <NN.*>+ <IN>)? <JJ>* <NN.*>+}'
        chunker = RegexpParser(grammar)
        tagged_words=nltk.pos_tag_sents(nltk.word_tokenize(text) for text in nltk.sent_tokenize(sent))
        all_tag = list(chain.from_iterable(nltk.chunk.tree2conlltags(chunker.parse(tagged_word)) for tagged_word in tagged_words))
        for key, group in groupby(all_tag, lambda tag: tag[2] != 'O'):
            candidate = ' '.join([word for (word, pos, chunk) in group])
            if key is True and candidate not in stop_words:
                candidates.append(candidate)
    else:
        print("return word or phrase as target phrase")
    return candidates