如何使用python(nltk)匹配段落中的关键字

时间:2017-12-27 12:19:25

标签: python machine-learning nltk

关键字:

Keywords={u'secondary': [u'sales growth', u'next generation store', u'Steps Down', u' Profit warning', u'Store Of The Future', u'groceries']}

段落:

paragraph="""HOUSTON -- Target has unveiled its first "next generation" store in the Houston area, part of a multibillion-dollar effort to reimagine more than 1,000 stores nationwide to compete with e-commerce giants.

The 124,000-square-foot store, which opened earlier this week at Aliana market center in Richmond, Texas, has two distinct entrances and aims to appeal to consumers on both ends of the shopping spectrum.

Busy families seeking convenience can enter the "ease" side of the store, which offers a supermarket-style experience. Customers can pick up online orders, both in store and curbside, and buy grab-and-go items like groceries, wine, last-minute gifts, cleaning supplies and prepared meals."""

有没有办法匹配段落中的关键字?(不使用正则表达式)

输出:

匹配关键字:下一代商店,杂货

2 个答案:

答案 0 :(得分:1)

不需要使用NLTK。首先,您必须清除段落中的文本,或更改列表中的值,以便使用'辅助键。 '""下一代"商店'和下一代商店'是两回事。

在此之后,您可以迭代' secondary'的值,并检查文本中是否存在任何这些字符串。

match = [i for i in Keywords['secondary'] if i in paragraph]

编辑:正如我上面指出的,'"下一代"商店'和下一代商店'是两个不同的东西,这是你只得到1场比赛的原因。如果你有下一代商店'和下一代商店'你会得到两场比赛 - 实际上有两场比赛。

<强> INPUT

paragraph="""HOUSTON -- Target has unveiled its first "next generation" store in the Houston area, part of a multibillion-dollar effort to reimagine more than 1,000 stores nationwide to compete with e-commerce giants.

The 124,000-square-foot store, which opened earlier this week at Aliana market center in Richmond, Texas, has two distinct entrances and aims to appeal to consumers on both ends of the shopping spectrum.

Busy families seeking convenience can enter the "ease" side of the store, which offers a supermarket-style experience. Customers can pick up online orders, both in store and curbside, and buy grab-and-go items like groceries, wine, last-minute gifts, cleaning supplies and prepared meals."""

<强>输出:

['groceries']

<强> INPUT

paragraph="""HOUSTON -- Target has unveiled its first next generation store in the Houston area, part of a multibillion-dollar effort to reimagine more than 1,000 stores nationwide to compete with e-commerce giants.

The 124,000-square-foot store, which opened earlier this week at Aliana market center in Richmond, Texas, has two distinct entrances and aims to appeal to consumers on both ends of the shopping spectrum.

Busy families seeking convenience can enter the "ease" side of the store, which offers a supermarket-style experience. Customers can pick up online orders, both in store and curbside, and buy grab-and-go items like groceries, wine, last-minute gifts, cleaning supplies and prepared meals."""

<强>输出:

['next generation store','groceries']

答案 1 :(得分:0)

首先,如果您的关键字只有一个密钥,那么您真的不需要HKWorkoutRouteTypeIdentifier。请改用dict

set()

然后从Find multi-word terms in a tokenized text in Python

进行小调整
Keywords={u'secondary': [u'sales growth', u'next generation store', 
                         u'Steps Down', u' Profit warning', 
                         u'Store Of The Future', u'groceries']}

keywords = {u'sales growth', u'next generation store', 
            u'Steps Down', u' Profit warning', 
            u'Store Of The Future', u'groceries'}

paragraph="""HOUSTON -- Target has unveiled its first "next generation" store in the Houston area, part of a multibillion-dollar effort to reimagine more than 1,000 stores nationwide to compete with e-commerce giants.
The 124,000-square-foot store, which opened earlier this week at Aliana market center in Richmond, Texas, has two distinct entrances and aims to appeal to consumers on both ends of the shopping spectrum.
Busy families seeking convenience can enter the "ease" side of the store, which offers a supermarket-style experience. Customers can pick up online orders, both in store and curbside, and buy grab-and-go items like groceries, wine, last-minute gifts, cleaning supplies and prepared meals."""

[OUT]:

from nltk.tokenize import MWETokenizer
from nltk import sent_tokenize, word_tokenize

mwe = MWETokenizer([k.lower().split() for k in keywords], separator='_')

# Clean out the punctuations in your sentence.
import string
puncts = list(string.punctuation)
cleaned_paragraph = ''.join([ch if ch not in puncts else '' for ch in paragraph.lower()])

tokenized_paragraph = [token for token in mwe.tokenize(word_tokenize(cleaned_paragraph))
                       if token.replace('_', ' ') in keywords]

print(tokenized_paragraph)