Question

我正在使用并行python和多台机器编写并行彩虹表生成器。到目前为止，我让它在一台机器上工作。它会创建所有可能的密码，对它们进行哈希处理，保存到文件。它需要max_pass_len，file作为参数。 Charset是预定义的。这是代码：

def hashAndSave(listOfComb, fileObject):
    for item in listOfComb:
        hashedVal = crypt(item, 'po')
        fileObject.write("%s:%s\n" % (hashedVal, item))


def gen_rt_save(max_pw_len, file):
    charset = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'

    for i in range(3, max_pw_len):
        lista = [''.join(j) for j in combinations_with_replacement(charset, i)]
        hashAndSave(lista, file)

为了并行化，我需要在多台机器之间拆分工作。他们需要知道开始和停止生成密码的位置。

我想我需要一个函数，它接受两个参数作为参数 - 起点和终点。 Charset是全球性的，必须完全用于这种组合。

最简单的方法是从给定字符集和长度范围的所有可能组合的列表中选择由两个特定组合定义的子集。然而，这需要时间和空间，我需要避免这种情况。

示例：

charset='abcdefghi' #minified charset, normally 62 characters
ranged_comb(abf,defg)
result -> # it is not combination between to lists! there are specific functions for that, and they would not use full charset, only whats in lists
abf
abg
abh
abi
aca
acb
...
defd
defe
deff
defg

我考虑过使用charset字母的索引列表作为参数来在for循环中使用它。然而，我不能真正使用fors，因为它们的数量可能会有所不同。如何创建这样的功能？

Answer 1

因为强制密码/生成彩虹表你不需要严格的字典顺序，只要你经历所有排列（重复），这很简单：

def get_permutation_by_index(source, size, index):
    result = []
    for _ in range(size):
        result.append(source[index % len(source)])
        index = index // len(source)
    return result

然后你需要的只是你的排列的索引，以便从你的iterable中获取它（字符串也起作用）。它的作用基本上是循环遍历给定大小的每个可能元素位置，由传递的索引抵消，并将其存储在result列表中。如果您对从中获取字符串感兴趣，可以使用return "".join(result)。

现在，您的员工可以使用此功能生成“密码”范围块。最简单的分发方式是，如果您的工作人员从分销商处收到单个索引，执行他们的任务并等待下一个索引，但是，除非您的哈希功能极其缓慢地产生您的工作人员，数据传输等可能会变慢而不是在主过程中线性地执行一切。这就是为什么你理想地希望你的工人同时处理更大的块来证明分配整个过程的合理性。因此，您希望您的工作人员接受范围并执行以下操作：

def worker(source, size, start, end):
    result = []
    for i in range(start, end):
        result.append(get_permutation_by_index(source, size, i))  # add to the result
    return result

然后您需要的只是一个“经销商” - 一个中央调度员，负责指导工作人员并将工作量分配给他们。由于我们的工人不接受不同的尺寸（这是我要留下的练习，你有所有的成分），你的“经销商”需要通过尺寸推进，并跟踪它发送给你的工人的大块。这意味着，对于较小的边缘块，您的工作人员将接收的工作量低于定义的工作量，但在宏观方案中，这对您的用例来说无关紧要。所以，一个简单的经销商看起来像：

def distributor(source, start, end, chunk_size=1000):
    result = []
    for size in range(start, end + 1):  # for each size in the given range...
        total = len(source) ** size  # max number of permutations for this size
        for chunk in range(0, total, chunk_size):  # for each chunk...
            data = worker(source, size, chunk, min(chunk + chunk_size, total))  # process...
            result.append(data)  # store the result...
    return result

start和end代表您希望通过工作人员进行置换的源元素数量，chunk_size表示每个工作人员在理想情况下应处理的排列数量 - 如我提到过，如果给定大小的排列总数低于chunk_size，或者给定大小的未处理排列数少于chunk_size值，则不会出现这种情况，但这些都是边缘情况，我会留下让你弄清楚如何更均匀地分配。另外，请记住，返回的结果将是从我们的工作人员返回的列表列表 - 如果您想要平等对待所有结果，则必须将其展平。

但是等等，这不是使用单个进程的线性执行吗？嗯，当然是！我们在这里做的是有效地将分离的工作者与分发者分离，所以现在我们可以在中间添加任意数量的分离和/或并行化，而不会影响我们的执行。例如，以下是如何让我们的员工并行运行：

from multiprocessing import Pool
import time

def get_permutation_by_index(source, size, index):
    result = []
    for _ in range(size):
        result.append(source[index % len(source)])
        index = index // len(source)
    return result

# let's have our worker perform a naive ascii-shift Caesar cipher
def worker(source, size, start, end):
    result = []
    for i in range(start, end):
        time.sleep(0.2)  # simulate a long operation by adding 200 milliseconds of pause
        permutation = get_permutation_by_index(source, size, i)
        # naive Caesar cipher - simple ascii shift by +4 places
        result.append("".join([chr(ord(x) + 4) for x in permutation]))
    return result

def distributor(source, start, end, workers=10, chunk_size=10):
    pool = Pool(processes=workers)  # initiate our Pool with a specified number of workers
    jobs = set()  # store our worker result references
    for size in range(start, end + 1):  # for each size in the given range...
        total = len(source) ** size  # max number of permutations for this size
        for chunk in range(0, total, chunk_size):  # for each chunk...
            # add a call to the worker to our Pool
            r = pool.apply_async(worker,
                                 (source, size, chunk, min(chunk + chunk_size, total)))
            jobs.add(r)  # add our ApplyResult in the jobs set for a later checkup
    result = []
    while jobs:  # loop as long as we're waiting for results...
        for job in jobs:
            if job.ready():  # current worker finished processing...
                result.append(job.get())  # store our result...
                jobs.remove(job)
                break
        time.sleep(0.05)  # let other threads get a chance to breathe a little...
    return result  # keep in mind that this is NOT an ordered result

if __name__ == "__main__":  # important protection for cross-platform use

    # call 6 threaded workers to sift through all 2 and 3-letter permutations 
    # of "abcd", using the default chunk size ('ciphers per worker') of 10
    caesar_permutations = distributor("abcd", 2, 3, 6)

    print([perm for x in caesar_permutations for perm in x])  # print flattened results

# ['gg', 'hg', 'eh', 'fh', 'gh', 'hh', 'eff', 'fff', 'gff', 'hff', 'egf', 'fgf', 'ggf',
#  'hgf', 'ehf', 'fhf', 'ghf', 'hhf', 'eeg', 'feg', 'geg', 'heg', 'efg', 'ffg', 'gfg',
#  'hfg', 'eee', 'fee', 'gee', 'hee', 'efe', 'ffe', 'gfe', 'hfe', 'ege', 'fge', 'ee',
#  'fe', 'ge', 'he', 'ef', 'ff', 'gf', 'hf', 'eg', 'fg', 'gge', 'hge', 'ehe', 'fhe',
#  'ghe', 'hhe', 'eef', 'fef', 'gef', 'hef', 'ehh', 'fhh', 'ghh', 'hhh', 'egg', 'fgg',
#  'ggg', 'hgg', 'ehg', 'fhg', 'ghg', 'hhg', 'eeh', 'feh', 'geh', 'heh', 'efh', 'ffh',
#  'gfh', 'hfh', 'egh', 'fgh', 'ggh', 'hgh']

瞧！所有内容并行执行（如果底层操作系统正确调度，则通过多个内核执行）。这应该足以满足您的使用案例 - 您只需要在worker函数中添加通信或I / O代码，让实际代码由另一端的接收器执行，然后当您获得结果将它们返回distributor。您也可以直接在distributor()中编写表格，而不是等待一切完成。

如果您要通过网络独占执行此操作，您在多进程设置中并不真正需要它，线程就足以处理I / O延迟，因此只需将多进程导入替换为：{ {1}}（不要让模块名称欺骗你，这是一个线程接口，而不是多处理接口！）。

已定义范围内的列表组合

1 个答案: