如何独特的大文本文件内容

时间:2015-02-16 14:09:07

标签: python

我有一个包含34,686,770行的文本文件。所有行的长度都在50到250之间。有些行出现不止一行。我想让所有这些线条都独一无二。

我无法将所有这些行存储在列表中以使其唯一。我怎么能这样做。

Only has limited access to OBDII data stream unless you pay more money to upgrade the software.
I thought the author should have used more dialogue. It reads like a history book.
I thought the author should have used more dialogue. It reads like a history book.

我必须使用唯一的行创建文件。

Only has limited access to OBDII data stream unless you pay more money to upgrade the software.
I thought the author should have used more dialogue. It reads like a history book.

我该怎么做?

3 个答案:

答案 0 :(得分:5)

不将所有文本存储在内存中:

with open('text.txt') as text:
    with open('unique.txt', 'w') as output:
        seen = set()
        for line in text:
            line_hash = hash(line)
            if line_hash not in seen:
                output.write(line)
                seen.add(line_hash)

相反,我们正在存储文本的哈希值,这个哈希值要小得多。当然,存在哈希冲突的可能性,在这种情况下,此代码将跳过应包含的唯一行。

答案 1 :(得分:1)

使用shell工具:

$ cat in.txt 
Only has limited access to OBDII data stream unless you pay more money to upgrade the software.
I thought the author should have used more dialogue. It reads like a history book.
I thought the author should have used more dialogue. It reads like a history book.
$ sort < in.txt | uniq
I thought the author should have used more dialogue. It reads like a history book.
Only has limited access to OBDII data stream unless you pay more money to upgrade the software.

答案 2 :(得分:1)

如果您无法将文件加载到内存中,为什么不将它巧妙地拆分为较小的文件并在那里工作。您只需要知道相同的行最终会出现在相同的文件中,并且您希望某些冲突不会以大量文件结束。

这是一个脚本,它接受每个句子的前缀(可以明显改变)并将句子放在与前缀相对应的文件中。

这实际上就像一个哈希映射,只是不在内存中,因为你的RAM无法处理你正在尝试处理的数据量。

结果是许多较小的文件(存储桶,如果你愿意......),它们会出现在某个文件中分组的所有行(相同的前缀)。它们可以单独使用,然后合并到结果文件中。

以下是它的完成方式:

初始化程序以从文件input.txt读取,并使用前缀大小output.txt写入2以进行散列/拆分:

import os

input_file_name = 'input.txt'
split_folder = 'splits'
prefix_size = 2

创建包含包含相似和相同行的拆分文件的文件夹:

# create hash files folder
if not os.path.exists(split_folder):
    os.makedirs(split_folder)

行分布函数 - 在指定文件中放置一行:

# a function to put a line in a file
def put_in_file(file_name, line):
    with open(os.path.join(split_folder, file_name), 'a') as f:
        f.write(line)

哈希函数承诺一些碰撞(这是好的),相同的行在类似的文件中:

def prefix_hash(line):
    return line[:prefix_size]

现在我们将线路分发到较小的文件(例如hash&#34; buckets&#34;)

with open(input_file_name) as f:
    # convenience method
    def putter(line):
        put_in_file(prefix_hash(line), line)

    for line in f:
        putter(
            line + (os.linesep if not line.endswith(os.linesep) else '')
        )

生成创建的文件名列表:

split_file_names = map(
    lambda x: os.path.join(split_folder, x), os.listdir(split_folder)
)

在较小的文件中重复删除行:

for split_file_name in split_file_names:
    # dedup each file
    with open(split_file_name, 'r') as f:
        unique_lines = set(f.readlines())
    with open(split_file_name, 'w') as f:
        f.write(''.join(unique_lines))

将较小的文件加入结果文件:

output_file = "output.txt"
with open(output_file, 'w') as of:
    for split_file_name in split_file_names:
        with open(split_file_name, 'r') as f:
            of.write(f.read())

整个事情在一起:

import os

input_file_name = 'input.txt'
split_folder = 'splits'
prefix_size = 2

# create hash files folder
if not os.path.exists(split_folder):
    os.makedirs(split_folder)

# a function to put a line in a file
def put_in_file(file_name, line):
    with open(os.path.join(split_folder, file_name), 'a') as f:
        f.write(line)

def prefix_hash(line):
    return line[:prefix_size]

with open(input_file_name) as f:
    # convenience method
    def putter(line):
        put_in_file(prefix_hash(line), line)

    for line in f:
        putter(
            line + (os.linesep if not line.endswith(os.linesep) else '')
        )

split_file_names = map(
    lambda x: os.path.join(split_folder, x), os.listdir(split_folder)
)

for split_file_name in split_file_names:
    # dedup each file
    with open(split_file_name, 'r') as f:
        unique_lines = set(f.readlines())
    with open(split_file_name, 'w') as f:
        f.write(''.join(unique_lines))

output_file = "output.txt"
with open(output_file, 'w') as of:
    for split_file_name in split_file_names:
        with open(split_file_name, 'r') as f:
            of.write(f.read())

注意:为了加快速度,你应该始终保持文件处理程序处于打开状态,并且可能会使一些线程使用队列在它们之间传递线路(防止等待I / O)打开和关闭文件)。如果有人想要,我可以稍后添加。