Question

我正在尝试在pyspark中实现LSH，我的实现完美地适用于小型文档集，但是当文档集很大时，我会收到此错误：

AttributeError: Can't pickle local object '__hash_family__.<locals>.hash_member'

然后：

19/11/21 17:59:40 ERROR TaskSetManager: Task 0 in stage 3.0 failed 1 times; aborting job
Traceback (most recent call last):
  File "/Users/<my_home_dir>/PycharmProjects/data_mining/hw_2/ex_3/main_kijiji.py", line 62, in <module>
    lsh = signatures.reduce(lambda x, y: __update_hash_table__(x[0], x[1], lsh_b, lsh_r) +
  File "/Library/Python/3.7/site-packages/pyspark/rdd.py", line 844, in reduce
    vals = self.mapPartitions(func).collect()
  File "/Library/Python/3.7/site-packages/pyspark/rdd.py", line 816, in collect
    sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
  File "/Library/Python/3.7/site-packages/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/Library/Python/3.7/site-packages/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/Library/Python/3.7/site-packages/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)

此代码行产生了错误：

lsh = signatures.reduce(lambda x, y: __update_hash_table__(x[0], x[1], hash_tables, lsh_b, lsh_r) +
                                     __update_hash_table__(y[0], y[1], hash_tables, lsh_b, lsh_r)).cache()

其中 hash_tables 是以这种方式生成的列表：

hash_tables = [[__hash_family__(i, lsh_num_hashes), {}] for i in range(lsh_b)]

函数 hash_family 如下：

def __hash_family__(i, resultSize=20):
    maxLen = 10  # how long can our i be (in decimal)
    salt = str(i).zfill(maxLen)[-maxLen:]

    def hash_member(x):
        return hashlib.sha1((x + salt).encode()).digest()[-resultSize:]

    return hash_member

这是函数 update_hash_table ：

def __update_hash_table__(doc_id, sig, hash_tables, lsh_b, lsh_r):
    for b in range(lsh_b):
        start_row = b * lsh_r
        end_row = start_row + lsh_r
        band = str(sig[start_row:end_row])
        bucket_idx = hash_tables[b][0](''.join(band))
        try:
            hash_tables[b][1][bucket_idx].append(doc_id)
        except KeyError:
            hash_tables[b][1][bucket_idx] = [doc_id]
    return hash_tables

我什至试图直接在包含 update_hash_table 定义的文件中生成 hash_tables 或在函数内部生成表，但是我总是会出现腌制错误，如何重写代码以将reduce操作的结果存储在变量 lsh 中？我知道我可以从rdd收集签名并将其转换为列表，但这会非常昂贵，我是否可以在不增加执行时间的情况下执行此操作？

Pyspark无法腌制本地对象（功能列表）

0 个答案: