我对此脚本的目标是获取一个文件夹,包含所有文件中的每一行,然后按频率的降序输出一个包含每个唯一行的文件。
它不只是找到唯一的行,它会查找每个唯一行在所有文件中出现的频率。
它需要使用此脚本处理大量文本 - 至少约2GB,因此我需要有效地完成它。 到目前为止,我还没有实现这一目标。
import os, sys #needed for looking into a directory
from sys import argv #allows passing of arguments from command line, where I call the script
from collections import Counter #allows the lists to be sorted by number of occurrences
#Pass argument containing Directory of files to be combined
dir_string = str((argv[1]))
filenames=[]
#Get name of files in directory, add them to a list
for file in os.listdir(dir_string):
if file.endswith(".txt"):
filenames.append(os.path.join(dir_string, file)) #add names of files to a list
#Declare name of file to be written
out_file_name = dir_string+".txt"
#Create output file
outfile = open(out_file_name, "w")
#Declare list to be filled with lines seen
lines_seen = []
#Parse All Lines in all files
for fname in filenames: #for all files in list
with open(fname) as infile: #open a given file
for line in infile: #for all lines in current file, read one by one
#Here's the problem.
lines_seen.append(str(line).strip('\n')) #add line to list of lines seen,
#removing the endline
#Organizes the list by number of occurences, but produced a list that contains
# [(item a, # of a occurrences ), (item b, # of b occurrences)...]
lines_seen = Counter(lines_seen).most_common()
#Write file line by line to the output file
for item in lines_seen: outfile.write(str(item[0])+"\n")
outfile.close()
当我收到错误消息时,它是关于lines_seen.append(str(line).strip('\n'))
行的。
我首先尝试添加行而不转换为字符串和剥离,但它会在字符串中包含一个可见的'\ n',这是我不能接受的。 对于较小的列表,转换为字符串和剥离并没有太多的内存税。 我无法找到一种更有效的摆脱终结字符的方法
在我的电脑上,这导致MemoryError
,在我的Mac上,这给了我Killed: 9
- 尚未在Linux上尝试过。
我是否需要转换为二进制文件,组合我的有序列表然后转换回来? 怎么办呢?
编辑 - 对我来说,最好的总体方法是使用unix命令
cd DirectoryWithFiles
cat *.txt | sort | uniq -c | sort -n -r > wordlist_with_count.txt
cut -c6- wordlist_with_count.txt > wordlist_sorted.txt
答案 0 :(得分:0)
我会像这样解决这个问题
import os, sys #needed for looking into a directory
from sys import argv #allows passing of arguments from command line, where I call the script
from collections import Counter #allows the lists to be sorted by number of occurrences
#Pass argument containing Directory of files to be combined
dir_string = str((argv[1]))
#Get name of files in directory, add them to a list
filenames = []
for file in os.listdir(dir_string):
if file.endswith(".txt"):
filenames.append(os.path.join(dir_string, file)) #add names of files to a list
#Declare name of file to be written
out_file_name = os.path.join(dir_string, 'out.txt')
# write all the files to a single file instead of list
with open(out_file_name, "w") as outfile:
for fname in filenames: #for all files in list
with open(fname) as infile: #open a given file
for line in infile: #for all lines in current file, read one by one
outfile.write(line)
# create a counter object from outfile
with open(out_file_name, "r") as outfile:
c = Counter(outfile)
print "sorted by line alphabhitically"
from operator import itemgetter
print sorted(c.items(),key=itemgetter(0))
print "sorted by count"
print sorted(c.items(), key=itemgetter(1))
def index_in_file(unique_line):
with open(out_file_name, "r") as outfile:
for num, line in enumerate(outfile, 1):
if unique_line[0] in line:
return num
print "sorted by apperance of line in the outfile"
s= sorted(c.items(),key=index_in_file)
print s
# Once you decide what kind of sort you want, write the sorted elements into a outfile.
with open(out_file_name, "w") as outfile:
for ss in s:
outfile.write(ss[0].rstrip()+':'+str(ss[1])+'\n')
答案 1 :(得分:0)
这是减少内存消耗的方法,我在其他一个答案的评论中建议:
lines_seen = collections.Counter()
for filename in filenames:
with open(filename, 'r') as file:
for line in file:
line = line.strip('\n')
if line:
lines_seen.update([line])
with open(out_file_name, "w") as outfile:
for line, count in lines_seen.most_common():
outfile.write('{}, {}\n'.format(line, count))
请注意,line.strip('\n')
仅在每行读取结束时删除换行符,因此line.rstrip('\n')
会更有效。您可能还希望使用line.strip()
删除前导和尾随空格。摆脱存储的可能相当大的空白会进一步减少内存使用量。
答案 2 :(得分:-1)
你的问题显然是缺乏记忆。
您可以在此过程中消除lines_seen中的冗余行,这可能有所帮助。
from collections import Counter
lines_seen = Counter()
# in the for loop :
lines_seen[ lines_seen.append(str(line).strip('\n')) ] += 1
# at the end:
for item in lines_seen.most_common():
outfile.write(str(item[0])+"\n")
修改强>
如评论中所述,另一种解决方案是:
from collections import Counter
lines_seen = Counter()
# get the files names
for fname in filenames: #for all files in list
with open(fname) as infile: #open a given file
lines_seen.update(infile.read().split('\n'))
for item in lines_seen.most_common():
print( item[0], file=outfile )