我正在尝试从包含芬兰语文本的500mb文本文件中创建排名单词列表的csv文件。该脚本可以用小文件完成我想要的操作,但不适用于500mb的野兽。
我是使用Python的完整入门者,所以如果它很草率,请原谅我。从四周看,我认为可能必须逐行处理文件。
with open(...) as f:
for line in f:
# Do something with 'line'
我将不胜感激,加油!我的代码如下:
#load text
filename = 'finnish_text.txt'
file = open(filename, 'r')
text = file.read()
file.close()
#lowercase and split words by white space
lowercase = text.lower()
words = lowercase.split()
# remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in words]
# ranked word count specify return amount here
from collections import Counter
Counter = Counter(stripped)
most_occur = Counter.most_common(100)
# export csv file
import csv
with open('word_rank.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=',')
for x in most_occur:
writer.writerow(x)
编辑: 我最终使用了@Bharel(他的传说)的第二个解决方案,该解决方案在他的评论中给出。由于编码问题,我不得不更改几行。
with open(filename, 'r', encoding='Latin-1', errors='replace') as file:
with open('word_rank.csv', 'w', newline='', errors='replace') as csvfile:
答案 0 :(得分:1)
将所有内容切换到生成器,它应该可以工作:
#load text
filename = 'finnish_text.txt'
# Auto-close when done
with open(filename, 'r') as file:
#lowercase and split words by white space
word_iterables =(text.lower().split() for line in file)
# remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
stripped = (w.translate(table) for it in word_iterables for w in it)
# ranked word count specify return amount here
from collections import Counter
counter = Counter(stripped)
most_occur = counter.most_common(100)
# export csv file
import csv
with open('word_rank.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=',')
for x in most_occur:
writer.writerow(x)
通过使用生成器(用括号代替方括号),所有单词都被延迟处理,而不是一次全部加载到内存中。
如果您想以最有效的方式,我已经写了一个自我挑战:
import itertools
import operator
#load text
filename = 'finnish_text.txt'
# Auto-close when done
with open(filename, 'r') as file:
# Lowercase the lines
lower_lines = map(str.lower, file)
# Split the words in each line - will return [[word, word], [word, word]]
word_iterables = map(str.split, lower_lines)
# Combine the iterables:
# i.e. [[word, word], [word, word]] -> [word, word, word, word]
words = itertools.chain.from_iterable(word_iterables)
import string
table = str.maketrans('', '', string.punctuation)
# remove punctuation from each word
stripped = map(operator.methodcaller("translate", table), words)
# ranked word count specify return amount here
from collections import Counter
counter = Counter(stripped)
most_occur = counter.most_common(100)
# export csv file
import csv
with open('word_rank.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=',')
for x in most_occur:
writer.writerow(x)
它充分利用了用C语言编写的生成器(map和itertools)。