从行python中的n克中删除重复项

时间:2017-05-22 08:36:56

标签: python n-gram

此代码生成n-gram并显示n gram的计数。 我有一个包含行的csv文件和一个包含每行的单词串的列。 这个代码例如在搜索时获得了4克,就像这样,这是我的小狗'它还计算它在同一行中出现的次数。 我的意图是,当它连续出现n-gram时,它应该计算一次并在另一行中计算它的第二次,依此类推。

e.g  row         Word
      1          this is my puppy what this is my puppy
      2          this is my puppy

所以这段代码很重要'这是我的小狗'三次。但我希望它是2次

这是python代码

import collections
import re
import sys
import time


def tokenize(string):
    """Convert string to lowercase and split into words (ignoring
    punctuation), returning list of words.
    """
    return re.findall(r'\w+', string.lower())


def count_ngrams(lines, min_length=4, max_length=5):
    """Iterate through given lines iterator (file object or list of
    lines) and return n-gram frequencies. The return value is a dict
    mapping the length of the n-gram to a collections.Counter
    object of n-gram tuple and number of times that n-gram occurred.
    Returned dict includes n-grams of length min_length to max_length.
    """
    lengths = range(min_length, max_length + 1)
    ngrams = {length: collections.Counter() for length in lengths}
    queue = collections.deque(maxlen=max_length)

    # Helper function to add n-grams at start of current queue to dict
    def add_queue():
        current = tuple(queue)
        for length in lengths:
            if len(current) >= length: 
                ngrams[length][current[:length]] += 1

    # Loop through all lines and words and add n-grams to dict
    for line in lines:
        for word in tokenize(line):
            queue.append(word)
            if len(queue) >= max_length:
                    add_queue()

    # Make sure we get the n-grams at the tail end of the queue
    while len(queue) > min_length:
        queue.popleft()
        add_queue()

    return ngrams


def print_most_frequent(ngrams, num=10):
    """Print num most common n-grams of each length in n-grams dict."""
    for n in sorted(ngrams):
        print('----- {} most common {}-grams -----'.format(num, n))
        for gram, count in ngrams[n].most_common(num):
            print('{0}: {1}'.format(' '.join(gram), count))
        print('')


if __name__ == '__main__':
    if len(sys.argv) < 2:
        print('Usage: python ngrams.py filename')
        sys.exit(1)

    start_time = time.time()
    with open("PWorm.csv") as f:
        ngrams = count_ngrams(f)
    print_most_frequent(ngrams)
    elapsed_time = time.time() - start_time
    print('Took {:.03f} seconds'.format(elapsed_time))

我们将非常感谢您的帮助。 谢谢

1 个答案:

答案 0 :(得分:0)

您可以使用ngrams

而不是半手动填充defaultdict

为了防止一行中的同一个ngram计数两次,你必须每行制作一个ngram-dict,然后将它与一般的ngram dict结合起来

def count_ngrams(lines, min_length=4, max_length=5):
    """Iterate through given lines iterator (file object or list of
    lines) and return n-gram frequencies. The return value is a dict
    mapping the length of the n-gram to a collections.Counter
    object of n-gram tuple and number of times that n-gram occurred.
    Returned dict includes n-grams of length min_length to max_length.
    """
    lengths = range(min_length, max_length + 1)
    ngrams = collections.defaultdict(collections.Counter)
    queue = collections.deque(maxlen=max_length)

    # Helper function to add n-grams at start of current queue to dict
    def add_queue(ngrams_line):
        current = tuple(queue)
        for length in lengths:
            if len(current) >= length: 
                ngrams_line[length][current[:length]] = 1  # instead of += 1

    # to combine the 2 defaultdict(Counter)            
    def combine_ngrams(ngram, ngramline):
        for k, v in ngramsline.items():
            ngrams[k] += v
        return ngrams

    # Loop through all lines and words and add n-grams to dict
    for line in lines:
        ngrams_line = collections.defaultdict(collections.Counter)
        for word in tokenize(line):
            queue.append(word)
            if len(queue) >= max_length:
                    add_queue(ngrams_line)
        ngrams = combine_ngrams(ngrams, ngrams_line)


    # Make sure we get the n-grams at the tail end of the queue
    ngrams_line = collections.defaultdict(collections.Counter)
    while len(queue) > min_length:
        queue.popleft()
        add_queue(ngrams_line)
    ngrams = combine_ngrams(ngrams, ngrams_line)

    return ngrams

我不是100%理解while len(queue) > min_length:之后的部分,或者为什么queue没有重置每行,你可能需要稍微调整我的答案