将千兆字节的文本合并到一个文件中,按出现次数排序

时间:2017-03-22 00:52:12

标签: python list text-files find-occurrences word-list

我对此脚本的目标是获取一个文件夹,包含所有文件中的每一行,然后按频率的降序输出一个包含每个唯一行的文件。

它不只是找到唯一的行,它会查找每个唯一行在所有文件中出现的频率。

它需要使用此脚本处理大量文本 - 至少约2GB,因此我需要有效地完成它。 到目前为止,我还没有实现这一目标。

import os, sys #needed for looking into a directory
from sys import argv #allows passing of arguments from command line, where I call the script
from collections import Counter #allows the lists to be sorted by number of occurrences

#Pass argument containing Directory of files to be combined
dir_string = str((argv[1]))

filenames=[]  

#Get name of files in directory, add them to a list
for file in os.listdir(dir_string):
    if file.endswith(".txt"):
        filenames.append(os.path.join(dir_string, file)) #add names of files to a list

#Declare name of file to be written
out_file_name = dir_string+".txt"

#Create output file
outfile = open(out_file_name, "w")

#Declare list to be filled with lines seen
lines_seen = []

#Parse All Lines in all files
for fname in filenames: #for all files in list
    with open(fname) as infile: #open a given file
        for line in infile: #for all lines in current file, read one by one
                #Here's the problem.
                lines_seen.append(str(line).strip('\n')) #add line to list of lines seen,
                                                         #removing the endline

    #Organizes the list by number of occurences, but produced a list that contains
    # [(item a, # of a occurrences ), (item b, # of b occurrences)...]
    lines_seen = Counter(lines_seen).most_common()

    #Write file line by line to the output file
    for item in lines_seen: outfile.write(str(item[0])+"\n")

outfile.close()

当我收到错误消息时,它是关于lines_seen.append(str(line).strip('\n'))行的。

我首先尝试添加行而不转换为字符串和剥离,但它会在字符串中包含一个可见的'\ n',这是我不能接受的。 对于较小的列表,转换为字符串和剥离并没有太多的内存税。 我无法找到一种更有效的摆脱终结字符的方法

在我的电脑上,这导致MemoryError,在我的Mac上,这给了我Killed: 9 - 尚未在Linux上尝试过。

我是否需要转换为二进制文件,组合我的有序列表然后转换回来? 怎么办呢?

编辑 - 对我来说,最好的总体方法是使用unix命令

cd DirectoryWithFiles
cat *.txt | sort | uniq -c | sort -n -r > wordlist_with_count.txt
cut  -c6- wordlist_with_count.txt > wordlist_sorted.txt

3 个答案:

答案 0 :(得分:0)

我会像这样解决这个问题

import os, sys #needed for looking into a directory
from sys import argv #allows passing of arguments from command line, where I call the script
from collections import Counter #allows the lists to be sorted by number of occurrences

#Pass argument containing Directory of files to be combined
dir_string = str((argv[1]))


#Get name of files in directory, add them to a list
filenames = []
for file in os.listdir(dir_string):
    if file.endswith(".txt"):
        filenames.append(os.path.join(dir_string, file)) #add names of files to a list


#Declare name of file to be written
out_file_name = os.path.join(dir_string, 'out.txt')


# write all the files to a single file instead of list
with open(out_file_name, "w") as outfile:
    for fname in filenames: #for all files in list
        with open(fname) as infile: #open a given file
              for line in infile: #for all lines in current file, read one by one
                   outfile.write(line)

# create a counter object from outfile
with open(out_file_name, "r") as outfile:
    c = Counter(outfile)



print "sorted by line alphabhitically"
from operator import itemgetter   
print sorted(c.items(),key=itemgetter(0))

print "sorted by count"
print sorted(c.items(), key=itemgetter(1))


def index_in_file(unique_line):
    with open(out_file_name, "r") as outfile:
        for num, line in enumerate(outfile, 1):
            if unique_line[0] in line:
                return num

print "sorted by apperance of line in the outfile"
s= sorted(c.items(),key=index_in_file)
print s

# Once you decide what kind of sort you want, write the sorted elements into a outfile.
with open(out_file_name, "w") as outfile:
    for ss in s:
        outfile.write(ss[0].rstrip()+':'+str(ss[1])+'\n')

答案 1 :(得分:0)

这是减少内存消耗的方法,我在其他一个答案的评论中建议:

lines_seen = collections.Counter()

for filename in filenames:
    with open(filename, 'r') as file:
        for line in file:
            line = line.strip('\n')
            if line:
                lines_seen.update([line])

with open(out_file_name, "w") as outfile:
    for line, count in lines_seen.most_common():
        outfile.write('{}, {}\n'.format(line, count))

请注意,line.strip('\n')仅在每行读取结束时删除换行符,因此line.rstrip('\n')会更有效。您可能还希望使用line.strip()删除前导和尾随空格。摆脱存储的可能相当大的空白会进一步减少内存使用量。

答案 2 :(得分:-1)

你的问题显然是缺乏记忆。

您可以在此过程中消除lines_seen中的冗余行,这可能有所帮助。

from collections import Counter
lines_seen = Counter()

# in the for loop :
lines_seen[ lines_seen.append(str(line).strip('\n')) ] += 1

# at the end:
for item in lines_seen.most_common():
    outfile.write(str(item[0])+"\n")

修改

如评论中所述,另一种解决方案是:

from collections import Counter
lines_seen = Counter()

# get the files names

for fname in filenames: #for all files in list
    with open(fname) as infile: #open a given file
        lines_seen.update(infile.read().split('\n'))

for item in lines_seen.most_common():
    print( item[0], file=outfile )