如何在python或R

时间:2018-03-31 16:27:52

标签: python r nlp text-mining

鉴于一些文字,如何在n = 1到6之间获得最常见的n-gram? 我已经看到了为3克或2克,一次一个n来获取它的方法,但有没有办法提取最有意义的最大长度短语,以及所有其他的呢?

例如,在本文中仅用于演示目的: fri evening commute can be long. some people avoid fri evening commute by choosing off-peak hours. there are much less traffic during off-peak.

n-gram及其计数器的理想结果是:

fri evening commute: 3,
off-peak: 2,
rest of the words: 1

任何建议表示赞赏。感谢。

2 个答案:

答案 0 :(得分:1)

<强>的Python

考虑提供ngrams函数的NLTK库,您可以使用它来迭代n的值。

粗略实现将遵循以下内容,其中粗略是关键字:

from nltk import ngrams
from collections import Counter

result = []
sentence = 'fri evening commute can be long. some people avoid fri evening commute by choosing off-peak hours. there are much less traffic during off-peak.'
# Since you are not considering periods and treats words with - as phrases
sentence = sentence.replace('.', '').replace('-', ' ')

for n in range(len(sentence.split(' ')), 1, -1):
    phrases = []

    for token in ngrams(sentence.split(), n):
        phrases.append(' '.join(token))

    phrase, freq = Counter(phrases).most_common(1)[0]
    if freq > 1:
        result.append((phrase, n))
        sentence = sentence.replace(phrase, '')

for phrase, freq in result:
    print('%s: %d' % (phrase, freq))

R

This might be helpful

答案 1 :(得分:1)