此代码生成n-gram并显示n gram的计数。 我有一个包含行的csv文件和一个包含每行的单词串的列。 这个代码例如在搜索时获得了4克,就像这样,这是我的小狗'它还计算它在同一行中出现的次数。 我的意图是,当它连续出现n-gram时,它应该计算一次并在另一行中计算它的第二次,依此类推。
e.g row Word
1 this is my puppy what this is my puppy
2 this is my puppy
所以这段代码很重要'这是我的小狗'三次。但我希望它是2次
这是python代码
import collections
import re
import sys
import time
def tokenize(string):
"""Convert string to lowercase and split into words (ignoring
punctuation), returning list of words.
"""
return re.findall(r'\w+', string.lower())
def count_ngrams(lines, min_length=4, max_length=5):
"""Iterate through given lines iterator (file object or list of
lines) and return n-gram frequencies. The return value is a dict
mapping the length of the n-gram to a collections.Counter
object of n-gram tuple and number of times that n-gram occurred.
Returned dict includes n-grams of length min_length to max_length.
"""
lengths = range(min_length, max_length + 1)
ngrams = {length: collections.Counter() for length in lengths}
queue = collections.deque(maxlen=max_length)
# Helper function to add n-grams at start of current queue to dict
def add_queue():
current = tuple(queue)
for length in lengths:
if len(current) >= length:
ngrams[length][current[:length]] += 1
# Loop through all lines and words and add n-grams to dict
for line in lines:
for word in tokenize(line):
queue.append(word)
if len(queue) >= max_length:
add_queue()
# Make sure we get the n-grams at the tail end of the queue
while len(queue) > min_length:
queue.popleft()
add_queue()
return ngrams
def print_most_frequent(ngrams, num=10):
"""Print num most common n-grams of each length in n-grams dict."""
for n in sorted(ngrams):
print('----- {} most common {}-grams -----'.format(num, n))
for gram, count in ngrams[n].most_common(num):
print('{0}: {1}'.format(' '.join(gram), count))
print('')
if __name__ == '__main__':
if len(sys.argv) < 2:
print('Usage: python ngrams.py filename')
sys.exit(1)
start_time = time.time()
with open("PWorm.csv") as f:
ngrams = count_ngrams(f)
print_most_frequent(ngrams)
elapsed_time = time.time() - start_time
print('Took {:.03f} seconds'.format(elapsed_time))
我们将非常感谢您的帮助。 谢谢
答案 0 :(得分:0)
您可以使用ngrams
defaultdict
为了防止一行中的同一个ngram计数两次,你必须每行制作一个ngram-dict,然后将它与一般的ngram dict结合起来
def count_ngrams(lines, min_length=4, max_length=5):
"""Iterate through given lines iterator (file object or list of
lines) and return n-gram frequencies. The return value is a dict
mapping the length of the n-gram to a collections.Counter
object of n-gram tuple and number of times that n-gram occurred.
Returned dict includes n-grams of length min_length to max_length.
"""
lengths = range(min_length, max_length + 1)
ngrams = collections.defaultdict(collections.Counter)
queue = collections.deque(maxlen=max_length)
# Helper function to add n-grams at start of current queue to dict
def add_queue(ngrams_line):
current = tuple(queue)
for length in lengths:
if len(current) >= length:
ngrams_line[length][current[:length]] = 1 # instead of += 1
# to combine the 2 defaultdict(Counter)
def combine_ngrams(ngram, ngramline):
for k, v in ngramsline.items():
ngrams[k] += v
return ngrams
# Loop through all lines and words and add n-grams to dict
for line in lines:
ngrams_line = collections.defaultdict(collections.Counter)
for word in tokenize(line):
queue.append(word)
if len(queue) >= max_length:
add_queue(ngrams_line)
ngrams = combine_ngrams(ngrams, ngrams_line)
# Make sure we get the n-grams at the tail end of the queue
ngrams_line = collections.defaultdict(collections.Counter)
while len(queue) > min_length:
queue.popleft()
add_queue(ngrams_line)
ngrams = combine_ngrams(ngrams, ngrams_line)
return ngrams
我不是100%理解while len(queue) > min_length:
之后的部分,或者为什么queue
没有重置每行,你可能需要稍微调整我的答案