Question

所以我有这个非常大的文本文件，它被认为是6400万密码。（https://crackstation.net/buy-crackstation-wordlist-password-cracking-dictionary.htm＆lt; - 较小的Wordlist（仅限人工密码））我无法使用notepad ++或任何其他编辑器打开它，即使我认为我有32GB或ram。

我试图一次性读取所有内容，同时删除副本然后将其存储在文件中：

import os

IN_FILE = "./realhuman_phill.txt"
base, ext = os.path.splitext(IN_FILE)
outfile = base + "_no_duplicate" + ext
print "reading " + IN_FILE
all_words = open(IN_FILE).read().splitlines()
print "{} element in file".format(len(all_words))
print "removing duplicates"
myset = set()
myset.update(all_words)
print "{} elements remaining after duplicate removal".format(len(myset))

print "writing data"
with open(outfile, 'w') as f:
    for line in myset:
        f.write("%s\n" % line)

然后我最终获得了一个~200MB的文件（之前超过600MB），只有19991889行（19.9百万）那么多重复？怪异

所以我根据这个脚本来计算行数 Lazy Method for Reading Big File in Python? 它应该一次仅加载在ram 1行的文件中：

abs_filename = r"D:\realhuman_phill.txt"
print "counting lines in {}".format(abs_filename)
with open(abs_filename) as infile:
    counter = 0
    for line in infile:
        counter = counter + 1 
print counter

它返回19991889 = 19 991 889，相同数字，远离6400万，没有重复删除。

我猜测python或我的操作系统不允许我访问文件的其余部分，对于发生了什么的任何想法？

由于

PS：Windows 8.1 64位，python 2.7 64位

Answer 1

问题可能出在Line Endings上。尝试将文件读取模式强制为二进制。

with open(abs_filename, 'rb') as infile:

无法读取600MB文本文件的全部内容

1 个答案: