Question

我对此脚本的目标是获取一个文件夹，包含所有文件中的每一行，然后按频率的降序输出一个包含每个唯一行的文件。

它不只是找到唯一的行，它会查找每个唯一行在所有文件中出现的频率。

它需要使用此脚本处理大量文本 - 至少约2GB，因此我需要有效地完成它。到目前为止，我还没有实现这一目标。

import os, sys #needed for looking into a directory
from sys import argv #allows passing of arguments from command line, where I call the script
from collections import Counter #allows the lists to be sorted by number of occurrences

#Pass argument containing Directory of files to be combined
dir_string = str((argv[1]))

filenames=[]  

#Get name of files in directory, add them to a list
for file in os.listdir(dir_string):
    if file.endswith(".txt"):
        filenames.append(os.path.join(dir_string, file)) #add names of files to a list

#Declare name of file to be written
out_file_name = dir_string+".txt"

#Create output file
outfile = open(out_file_name, "w")

#Declare list to be filled with lines seen
lines_seen = []

#Parse All Lines in all files
for fname in filenames: #for all files in list
    with open(fname) as infile: #open a given file
        for line in infile: #for all lines in current file, read one by one
                #Here's the problem.
                lines_seen.append(str(line).strip('\n')) #add line to list of lines seen,
                                                         #removing the endline

    #Organizes the list by number of occurences, but produced a list that contains
    # [(item a, # of a occurrences ), (item b, # of b occurrences)...]
    lines_seen = Counter(lines_seen).most_common()

    #Write file line by line to the output file
    for item in lines_seen: outfile.write(str(item[0])+"\n")

outfile.close()

当我收到错误消息时，它是关于lines_seen.append(str(line).strip('\n'))行的。

我首先尝试添加行而不转换为字符串和剥离，但它会在字符串中包含一个可见的'\ n'，这是我不能接受的。对于较小的列表，转换为字符串和剥离并没有太多的内存税。我无法找到一种更有效的摆脱终结字符的方法

在我的电脑上，这导致MemoryError，在我的Mac上，这给了我Killed: 9 - 尚未在Linux上尝试过。

我是否需要转换为二进制文件，组合我的有序列表然后转换回来？怎么办呢？

编辑 - 对我来说，最好的总体方法是使用unix命令

cd DirectoryWithFiles
cat *.txt | sort | uniq -c | sort -n -r > wordlist_with_count.txt
cut  -c6- wordlist_with_count.txt > wordlist_sorted.txt

Answer 1

我会像这样解决这个问题

import os, sys #needed for looking into a directory
from sys import argv #allows passing of arguments from command line, where I call the script
from collections import Counter #allows the lists to be sorted by number of occurrences

#Pass argument containing Directory of files to be combined
dir_string = str((argv[1]))


#Get name of files in directory, add them to a list
filenames = []
for file in os.listdir(dir_string):
    if file.endswith(".txt"):
        filenames.append(os.path.join(dir_string, file)) #add names of files to a list


#Declare name of file to be written
out_file_name = os.path.join(dir_string, 'out.txt')


# write all the files to a single file instead of list
with open(out_file_name, "w") as outfile:
    for fname in filenames: #for all files in list
        with open(fname) as infile: #open a given file
              for line in infile: #for all lines in current file, read one by one
                   outfile.write(line)

# create a counter object from outfile
with open(out_file_name, "r") as outfile:
    c = Counter(outfile)



print "sorted by line alphabhitically"
from operator import itemgetter   
print sorted(c.items(),key=itemgetter(0))

print "sorted by count"
print sorted(c.items(), key=itemgetter(1))


def index_in_file(unique_line):
    with open(out_file_name, "r") as outfile:
        for num, line in enumerate(outfile, 1):
            if unique_line[0] in line:
                return num

print "sorted by apperance of line in the outfile"
s= sorted(c.items(),key=index_in_file)
print s

# Once you decide what kind of sort you want, write the sorted elements into a outfile.
with open(out_file_name, "w") as outfile:
    for ss in s:
        outfile.write(ss[0].rstrip()+':'+str(ss[1])+'\n')

Answer 2

这是减少内存消耗的方法，我在其他一个答案的评论中建议：

lines_seen = collections.Counter()

for filename in filenames:
    with open(filename, 'r') as file:
        for line in file:
            line = line.strip('\n')
            if line:
                lines_seen.update([line])

with open(out_file_name, "w") as outfile:
    for line, count in lines_seen.most_common():
        outfile.write('{}, {}\n'.format(line, count))

请注意，line.strip('\n')仅在每行读取结束时删除换行符，因此line.rstrip('\n')会更有效。您可能还希望使用line.strip()删除前导和尾随空格。摆脱存储的可能相当大的空白会进一步减少内存使用量。

Answer 3

你的问题显然是缺乏记忆。

您可以在此过程中消除lines_seen中的冗余行，这可能有所帮助。

from collections import Counter
lines_seen = Counter()

# in the for loop :
lines_seen[ lines_seen.append(str(line).strip('\n')) ] += 1

# at the end:
for item in lines_seen.most_common():
    outfile.write(str(item[0])+"\n")

修改

如评论中所述，另一种解决方案是：

from collections import Counter lines_seen = Counter() # get the files names for fname in filenames: #for all files in list with open(fname) as infile: #open a given file lines_seen.update(infile.read().split('\n')) for item in lines_seen.most_common(): print( item[0], file=outfile )

将千兆字节的文本合并到一个文件中，按出现次数排序

3 个答案: