我有一个大型语料库,它来自与县对应的161行的csv,如下所示:
place_aggregated_listings[['titles', 'descriptions']].to_csv(r'./place_aggregated_listings.txt', header=None, index=None, sep=' ', mode='a' )
corpus = nltk.corpus.reader.plaintext.PlaintextCorpusReader(root='./', fileids='place_aggregated_listings.txt')
我有一个具有以下canonicalize_word()函数的utils.py文件:
def canonicalize_word(word, wordset=None, digits=True):
word = word.lower()
if digits:
if (wordset != None) and (word in wordset): return word
word = canonicalize_digits(word) # try to canonicalize numbers
if (wordset == None) or (word in wordset):
return word
else:
return u"<unk>"
这时,我构建了一个vocab对象,在大型GCP实例上运行需要花费1.5个小时以上的时间:
# Collect counts of tokens and assign wordids.
vocab = vocabulary.Vocabulary(token_feed, progressbar=ProgressBar)
class Vocabulary(object):
START_TOKEN = constants.START_TOKEN
END_TOKEN = constants.END_TOKEN
UNK_TOKEN = constants.UNK_TOKEN
def __init__(self, tokens, size=None,
progressbar=lambda l:l):
"""Create a Vocabulary object.
Args:
tokens: iterator( string )
size: None for unlimited, or int > 0 for a fixed-size vocab.
Vocabulary size includes special tokens <s>, </s>, and <unk>
progressbar: (optional) progress bar to wrap iterator.
"""
self.unigram_counts = Counter()
self.bigram_counts = defaultdict(lambda: Counter())
prev_word = None
for word in progressbar(tokens): # Make a single pass through tokens
self.unigram_counts[word] += 1
self.bigram_counts[prev_word][word] += 1
prev_word = word
self.bigram_counts.default_factory = None # make into a normal dict
# Leave space for "<s>", "</s>", and "<unk>"
top_counts = self.unigram_counts.most_common(None if size is None else (size - 3))
vocab = ([self.START_TOKEN, self.END_TOKEN, self.UNK_TOKEN] +
[w for w,c in top_counts])
# Assign an id to each word, by frequency
self.id_to_word = dict(enumerate(vocab))
self.word_to_id = {v:k for k,v in self.id_to_word.items()}
self.size = len(self.id_to_word)
if size is not None:
assert(self.size <= size)
# For convenience
self.wordset = set(self.word_to_id.keys())
# Store special IDs
self.START_ID = self.word_to_id[self.START_TOKEN]
self.END_ID = self.word_to_id[self.END_TOKEN]
self.UNK_ID = self.word_to_id[self.UNK_TOKEN]
我想用此代码获取uni,bi和trigram,它们出现在161个csv行中的25%以上。现在,它只是uni&bigrams,不会过滤。我想知道最有效的方法是什么,以及是否有一种方法可以改善当前的uni / bigram提取代码。谢谢。
样本数据:
[将很快发布]