作为我学习更多关于NLP的下一步,我尝试实现一种简单的启发式方法,可以提高结果,而不仅仅是简单的n-gram。
根据下面链接的斯坦福大学合着,他们提到通过一部分语音过滤器传递候选短语,这些过滤器只允许通过那些可能是“短语""与仅使用最常出现的二元组相比,它将产生更好的结果。 资料来源:搭配,第143-144页:https://nlp.stanford.edu/fsnlp/promo/colloc.pdf
第144页的表格有7种标记模式。按顺序,NLTK POS标签等效于:
JJ NN
NN
JJ JJ NN
JJ NN NN
NN JJ NNNN IN NN
在下面的代码中,当我独立应用下面的每个语法时,我可以得到所需的结果。然而,当我尝试组合相同的语法时,我没有收到所需的结果。
在我的代码中,您可以看到我取消注释一个句子,取消注释1个语法,运行它并检查结果。
我应该能够将所有句子组合起来,通过组合语法运行(在下面的代码中只有3个)并获得所需的结果。
我的问题是,如何正确组合语法?
我假设结合语法就像是' OR',找到这种模式,或者这种模式......
提前致谢。
import nltk
# The following sentences are correctly grouped with <JJ>*<NN>+.
# Should see: 'linear function', 'regression coefficient', 'Gaussian random variable' and
# 'cumulative distribution function'
SampleSentence = "In mathematics, the term linear function refers to two distinct, although related, notions"
#SampleSentence = "The regression coefficient is the slope of the line of the regression equation."
#SampleSentence = "In probability theory, Gaussian random variable is a very common continuous probability distribution."
#SampleSentence = "In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x."
# The following sentences are correctly grouped with <NN.?>*<V.*>*<NN>
# Should see 'mean squared error' and # 'class probability function'.
#SampleSentence = "In statistics, the mean squared error (MSE) of an estimator measures the average of the squares of the errors, that is, the difference between the estimator and what is estimated."
#SampleSentence = "The class probability function is interesting"
# The sentence below is correctly grouped with <NN.?>*<IN>*<NN.?>*.
# should see 'degrees of freedom'.
#SampleSentence = "In statistics, the degrees of freedom is the number of values in the final calculation of a statistic that are free to vary."
SampleSentence = SampleSentence.lower()
print("\nFull sentence: ", SampleSentence, "\n")
tokens = nltk.word_tokenize(SampleSentence)
textTokens = nltk.Text(tokens)
# Determine the POS tags.
POStagList = nltk.pos_tag(textTokens)
# The following grammars work well *independently*
grammar = "NP: {<JJ>*<NN>+}"
#grammar = "NP: {<NN.?>*<V.*>*<NN>}"
#grammar = "NP: {<NN.?>*<IN>*<NN.?>*}"
# Merge several grammars above into a single one below.
# Note that all 3 correct grammars above are included below.
'''
grammar = """
NP:
{<JJ>*<NN>+}
{<NN.?>*<V.*>*<NN>}
{<NN.?>*<IN>*<NN.?>*}
"""
'''
cp = nltk.RegexpParser(grammar)
result = cp.parse(POStagList)
for subtree in result.subtrees(filter=lambda t: t.label() == 'NP'):
print("NP Subtree:", subtree)
答案 0 :(得分:0)
如果我的评论是您正在寻找的,那么下面就是答案:
grammar = """
NP:
{<JJ>*<NN.?>*<V.|IN>*<NN.?>*}"""