我需要在NLTK中编写一个程序,将语料库(大量的txt文件集合)分成unigrams,bigrams,trigrams,fourgrams和fivegrams。我已编写代码将我的文件输入程序。
输入是用英文写的300 .txt文件,我希望以Ngrams的形式输出,特别是频率计数。
我知道NLTK有Bigram和Trigram模块:http://www.nltk.org/_modules/nltk/model/ngram.html
但是进入我的程序并不是那么先进。
输入:txt文件不是单句
输出示例:
Bigram [('Hi', 'How'), ('How', 'are'), ('are', 'you'), ('you', '?'), ('?', 'i'), ('i', 'am'), ('am', 'fine'), ('fine', 'and'), ('and', 'you')]
Trigram: [('Hi', 'How', 'are'), ('How', 'are', 'you'), ('are', 'you', '?'), ('you', '?', 'i'), ('?', 'i', 'am'), ('i', 'am', 'fine'), ('am', 'fine', 'and'), ('fine', 'and', 'you')]
到目前为止我的代码是:
from nltk.corpus import PlaintextCorpusReader
corpus = 'C:/Users/jack3/My folder'
files = PlaintextCorpusReader(corpus, '.*')
ngrams=2
def generate(file, ngrams):
for gram in range(0, ngrams):
print((file[0:-4]+"_"+str(ngrams)+"_grams.txt").replace("/","_"))
for file in files.fileids():
generate(file, ngrams)
接下来应该做些什么?
答案 0 :(得分:24)
只需使用ntlk.ngrams
。
import nltk
from nltk import word_tokenize
from nltk.util import ngrams
from collections import Counter
text = "I need to write a program in NLTK that breaks a corpus (a large collection of \
txt files) into unigrams, bigrams, trigrams, fourgrams and fivegrams.\
I need to write a program in NLTK that breaks a corpus"
token = nltk.word_tokenize(text)
bigrams = ngrams(token,2)
trigrams = ngrams(token,3)
fourgrams = ngrams(token,4)
fivegrams = ngrams(token,5)
print Counter(bigrams)
Counter({('program', 'in'): 2, ('NLTK', 'that'): 2, ('that', 'breaks'): 2,
('write', 'a'): 2, ('breaks', 'a'): 2, ('to', 'write'): 2, ('I', 'need'): 2,
('a', 'corpus'): 2, ('need', 'to'): 2, ('a', 'program'): 2, ('in', 'NLTK'): 2,
('and', 'fivegrams'): 1, ('corpus', '('): 1, ('txt', 'files'): 1, ('unigrams',
','): 1, (',', 'trigrams'): 1, ('into', 'unigrams'): 1, ('trigrams', ','): 1,
(',', 'bigrams'): 1, ('large', 'collection'): 1, ('bigrams', ','): 1, ('of',
'txt'): 1, (')', 'into'): 1, ('fourgrams', 'and'): 1, ('fivegrams', '.'): 1,
('(', 'a'): 1, (',', 'fourgrams'): 1, ('a', 'large'): 1, ('.', 'I'): 1,
('collection', 'of'): 1, ('files', ')'): 1})
UPDATE(使用纯python):
import os
corpus = []
path = '.'
for i in os.walk(path).next()[2]:
if i.endswith('.txt'):
f = open(os.path.join(path,i))
corpus.append(f.read())
frequencies = Counter([])
for text in corpus:
token = nltk.word_tokenize(text)
bigrams = ngrams(token, 2)
frequencies += Counter(bigrams)
答案 1 :(得分:5)
如果效率是一个问题,你必须建立多个不同的n-gram,但你想使用纯python,我会这样做:
from itertools import chain
def n_grams(seq, n=1):
"""Returns an iterator over the n-grams given a list_tokens"""
shift_token = lambda i: (el for j,el in enumerate(seq) if j>=i)
shifted_tokens = (shift_token(i) for i in range(n))
tuple_ngrams = zip(*shifted_tokens)
return tuple_ngrams # if join in generator : (" ".join(i) for i in tuple_ngrams)
def range_ngrams(list_tokens, ngram_range=(1,2)):
"""Returns an itirator over all n-grams for n in range(ngram_range) given a list_tokens."""
return chain(*(n_grams(list_tokens, i) for i in range(*ngram_range)))
用法:
>>> input_list = input_list = 'test the ngrams generator'.split()
>>> list(range_ngrams(input_list, ngram_range=(1,3)))
[('test',), ('the',), ('ngrams',), ('generator',), ('test', 'the'), ('the', 'ngrams'), ('ngrams', 'generator'), ('test', 'the', 'ngrams'), ('the', 'ngrams', 'generator')]
〜与NLTK相同的速度:
import nltk
%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=5)
# 7.02 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
n_grams(input_list,n=5)
# 7.01 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=1)
nltk.ngrams(input_list,n=2)
nltk.ngrams(input_list,n=3)
nltk.ngrams(input_list,n=4)
nltk.ngrams(input_list,n=5)
# 7.32 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
range_ngrams(input_list, ngram_range=(1,6))
# 7.13 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
从previous answer重新发布。
答案 2 :(得分:2)
好的,既然你要求提供NLTK解决方案,这可能不是你想要的那个,但是,你考虑过TextBlob吗?它有一个NLTK后端,但它的语法更简单。它看起来像这样:
from textblob import TextBlob
text = "Paste your text or text-containing variable here"
blob = TextBlob(text)
ngram_var = blob.ngrams(n=3)
print(ngram_var)
Output:
[WordList(['Paste', 'your', 'text']), WordList(['your', 'text', 'or']), WordList(['text', 'or', 'text-containing']), WordList(['or', 'text-containing', 'variable']), WordList(['text-containing', 'variable', 'here'])]
你当然还需要使用Counter或其他方法来为每个ngram添加一个计数。
然而,最快的方法(到目前为止)我已经找到了创建任何你想要的ngram并且也算在一个函数中它们来自2012年的this帖子并使用Itertools 。太棒了。
答案 3 :(得分:2)
这是一个使用纯Python生成任何ngram
的简单示例:
>>> def ngrams(s, n=2, i=0):
... while len(s[i:i+n]) == n:
... yield s[i:i+n]
... i += 1
...
>>> txt = 'Python is one of the awesomest languages'
>>> unigram = ngrams(txt.split(), n=1)
>>> list(unigram)
[['Python'], ['is'], ['one'], ['of'], ['the'], ['awesomest'], ['languages']]
>>> bigram = ngrams(txt.split(), n=2)
>>> list(bigram)
[['Python', 'is'], ['is', 'one'], ['one', 'of'], ['of', 'the'], ['the', 'awesomest'], ['awesomest', 'languages']]
>>> trigram = ngrams(txt.split(), n=3)
>>> list(trigram)
[['Python', 'is', 'one'], ['is', 'one', 'of'], ['one', 'of', 'the'], ['of', 'the', 'awesomest'], ['the', 'awesomest',
'languages']]
答案 4 :(得分:0)
@hellpander的答案上面是正确的,但是对于很大的语料却没有效果(我在处理约650K文档时遇到了困难)。每次更新频率时,由于随着内容的增长而对字典进行的昂贵查找,代码将大大减慢速度。因此,您将需要具有其他缓冲区变量以帮助缓存@hellpander答案的频率计数器。因此,每次迭代新文档时,都不会对很大的频率(字典)进行键查找,因此可以将其添加到较小的临时Counter字典中。然后,经过一些迭代,它将被加到全局频率上。这样,它将更快,因为庞大的字典查找的频率要低得多。
import os
corpus = []
path = '.'
for i in os.walk(path).next()[2]:
if i.endswith('.txt'):
f = open(os.path.join(path,i))
corpus.append(f.read())
frequencies = Counter([])
for i in range(0, len(corpus)):
token = nltk.word_tokenize(corpus[i])
bigrams = ngrams(token, 2)
f += Counter(bigrams)
if (i%10000 == 0):
# store to global frequencies counter and clear up f every 10000 docs.
frequencies += Counter(bigrams)
f = Counter([])
答案 5 :(得分:-1)
也许有帮助。见link
13