Question

我有两个文件：

file1（200mln行）格式：email：hash1：hash2
file2（90mln行）格式：hash：plaintext

我想要做的是将file1中的hash（1或2）替换为file2中的纯文本。我尝试使用之前在此问过的问题解决方案two lists, faster comparison in python（实际代码粘贴在下面）但不幸的是，这些大型数据集的速度并不快。它适用于较小的文件（少量行），但不适用于较大的文件。

你有什么建议可以“更快”地处理这两个文件？

编辑：上面提到的源代码

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys, os

def banner():
    print('\n%s v 1.0\nby d2@tdhack.com\n' % sys.argv[0])

def getlength(fname):
    return sum(1 for line in open(fname))

def ifexist(fname):
    if not os.path.isfile(fname):
        banner()
        print('[-] %s must exist' % fname)
        sys.exit(1)

def replace(l, X, Y):
  for i,v in enumerate(l):
     if v == X:
        l.pop(i)
        l.insert(i, Y)

if len(sys.argv) < 2:
    banner()
    print('[-] please provide CRACKED and HASHES files')
    sys.exit(1)

CRACKED=sys.argv[1]
HASHES=sys.argv[2]

ifexist(CRACKED)
ifexist(HASHES)

banner()
print('[i] preparing lists from "%s" [%d lines] and "%s" [%d lines]' %(CRACKED, getlength(CRACKED), HASHES, getlength(HASHES)))
with open(CRACKED) as crackedfile:
    cracked = dict(map(str, line.split(':', 1)) for line in crackedfile if ':' in line)

hashdata = [line.rstrip('\n') for line in open(HASHES)]

print('[i] pairing items, this will take a while so please be patient')
for item in hashdata:
    if item in cracked:
        replace(hashdata, item, item+':'+cracked[item].strip('\n'))

print('[i] writting changes')
fout = open(HASHES+'_paired', 'w')
for item in hashdata:
    fout.write(item+'\n')
fout.close()

print('[+] done, now check "%s" [%d lines] file for results.' % (HASHES+'_paired', getlength(HASHES+'_paired')))

Answer 1

有了这么多密钥，我强烈建议在Python中使用某种数据库来完成你的任务。使用SQL数据库，您将拥有两个如下所示的表：

<强> emails_and_hashes

column_name | column_type
----------- | ------------
email       | VARCHAR(255)
----------- | ------------
hash1       | VARCHAR(255)
----------- | ------------
hash2       | VARCHAR(255)

hash1上的索引和hash2上的索引。

<强> hash_to_plaintext

column_name | column_type
----------- | ------------
hash        | VARCHAR(255)
----------- | ------------
plaintext   | TEXT

hash上的索引。

然后使用Python DB连接器迭代这两个表并用Python更新它们的记录。这比尝试处理dict中的数亿条记录要快得多。您可以使用类似于以下的代码（您可能需要进行一些调整，这不是完全答案），使用此表设置，MySQL数据库和{{3}}：

import mysql.connector
con = mysql.connector.connect(user='your_user', password='your_password', database='your_database', host='your_host')
cur = con.cursor(dictionary=True) # 'dictionary=True' is my preference

# open your file with emails and hashes
f = open('/path/to/file1', 'r')
for line in f:
    email = line.split(':')[0]
    hash1 = line.split(':')[1]
    hash2 = line.split(':')[2]

    cur.execute("SELECT plaintext FROM hash_to_plaintext WHERE hash = %s", (hash1))
    plaintext1 = cur.fetchall()[0]
    cur.execute("SELECT plaintext FROM hash_to_plaintext WHERE hash = %s", (hash2))
    plaintext2 = cur.fetchall()[0]

    cur.execute("INSERT INTO emails_and_hashes VALUES (%s, %s, %s)", (email, hash1, hash2))

con.commit()
con.close()

Answer 2

经过一天的思考，我想出了使用Trie的想法。

trie将允许您将重复的哈希字典存储在更高效的容器中并以相同的成本查找。

在PyPi中存在一个名为marisa-trie的Trie的良好实现。

以下是关于如何实施它的想法：

import marisa_trie
import operator

with open("file2", "rb") as myfile:
    trie = marisa_trie.BytesTrie(map(operator.methodcaller("split", b":", 1), myfile))

with open("file1", "rb") as input_file, open("modified_file1", "wb") as output_file:
    for line in input_file:
        email, hash1, hash2 = line.split(b":")
        output_file.write(b":".join([email, trie[hash1], trie[hash2]]))

这应该是快速的，并且比dict的内存效率高50-100倍。

您还可以存储已处理的线索，因此您不必每次都重新创建它：

trie.save('my_hashes.trie')

然后加载它：

trie = marisa_trie.BytesTrie()
trie.load('my_hashes.trie')

python配对两个大型列表中的项目

2 个答案: