从语料库加速建筑令牌计数

时间:2018-12-10 00:19:33

标签: pandas nlp tokenize

我有一个大型语料库,它来自与县对应的161行的csv,如下所示:

place_aggregated_listings[['titles', 'descriptions']].to_csv(r'./place_aggregated_listings.txt', header=None, index=None, sep=' ', mode='a' )

corpus = nltk.corpus.reader.plaintext.PlaintextCorpusReader(root='./', fileids='place_aggregated_listings.txt')

我有一个具有以下canonicalize_word()函数的utils.py文件:

def canonicalize_word(word, wordset=None, digits=True):
    word = word.lower()
    if digits:
        if (wordset != None) and (word in wordset): return word
        word = canonicalize_digits(word) # try to canonicalize numbers
    if (wordset == None) or (word in wordset):
        return word
    else:
        return u"<unk>"

这时,我构建了一个vocab对象,在大型GCP实例上运行需要花费1.5个小时以上的时间:

# Collect counts of tokens and assign wordids.
vocab = vocabulary.Vocabulary(token_feed, progressbar=ProgressBar)

class Vocabulary(object):

START_TOKEN = constants.START_TOKEN
END_TOKEN   = constants.END_TOKEN
UNK_TOKEN   = constants.UNK_TOKEN

def __init__(self, tokens, size=None,
             progressbar=lambda l:l):
    """Create a Vocabulary object.

    Args:
        tokens: iterator( string )
        size: None for unlimited, or int > 0 for a fixed-size vocab.
              Vocabulary size includes special tokens <s>, </s>, and <unk>
        progressbar: (optional) progress bar to wrap iterator.
    """
    self.unigram_counts = Counter()
    self.bigram_counts = defaultdict(lambda: Counter())
    prev_word = None
    for word in progressbar(tokens):  # Make a single pass through tokens
        self.unigram_counts[word] += 1
        self.bigram_counts[prev_word][word] += 1
        prev_word = word
    self.bigram_counts.default_factory = None  # make into a normal dict

    # Leave space for "<s>", "</s>", and "<unk>"
    top_counts = self.unigram_counts.most_common(None if size is None else (size - 3))
    vocab = ([self.START_TOKEN, self.END_TOKEN, self.UNK_TOKEN] +
             [w for w,c in top_counts])

    # Assign an id to each word, by frequency
    self.id_to_word = dict(enumerate(vocab))
    self.word_to_id = {v:k for k,v in self.id_to_word.items()}
    self.size = len(self.id_to_word)
    if size is not None:
        assert(self.size <= size)

    # For convenience
    self.wordset = set(self.word_to_id.keys())

    # Store special IDs
    self.START_ID = self.word_to_id[self.START_TOKEN]
    self.END_ID = self.word_to_id[self.END_TOKEN]
    self.UNK_ID = self.word_to_id[self.UNK_TOKEN]

我想用此代码获取uni,bi和trigram,它们出现在161个csv行中的25%以上。现在,它只是uni&bigrams,不会过滤。我想知道最有效的方法是什么,以及是否有一种方法可以改善当前的uni / bigram提取代码。谢谢。

样本数据:

[将很快发布]

0 个答案:

没有答案