我正在尝试在pyspark中实现LSH,我的实现完美地适用于小型文档集,但是当文档集很大时,我会收到此错误:
AttributeError: Can't pickle local object '__hash_family__.<locals>.hash_member'
然后:
19/11/21 17:59:40 ERROR TaskSetManager: Task 0 in stage 3.0 failed 1 times; aborting job
Traceback (most recent call last):
File "/Users/<my_home_dir>/PycharmProjects/data_mining/hw_2/ex_3/main_kijiji.py", line 62, in <module>
lsh = signatures.reduce(lambda x, y: __update_hash_table__(x[0], x[1], lsh_b, lsh_r) +
File "/Library/Python/3.7/site-packages/pyspark/rdd.py", line 844, in reduce
vals = self.mapPartitions(func).collect()
File "/Library/Python/3.7/site-packages/pyspark/rdd.py", line 816, in collect
sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "/Library/Python/3.7/site-packages/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/Library/Python/3.7/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/Library/Python/3.7/site-packages/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
此代码行产生了错误:
lsh = signatures.reduce(lambda x, y: __update_hash_table__(x[0], x[1], hash_tables, lsh_b, lsh_r) +
__update_hash_table__(y[0], y[1], hash_tables, lsh_b, lsh_r)).cache()
其中 hash_tables 是以这种方式生成的列表:
hash_tables = [[__hash_family__(i, lsh_num_hashes), {}] for i in range(lsh_b)]
函数 hash_family 如下:
def __hash_family__(i, resultSize=20):
maxLen = 10 # how long can our i be (in decimal)
salt = str(i).zfill(maxLen)[-maxLen:]
def hash_member(x):
return hashlib.sha1((x + salt).encode()).digest()[-resultSize:]
return hash_member
这是函数 update_hash_table :
def __update_hash_table__(doc_id, sig, hash_tables, lsh_b, lsh_r):
for b in range(lsh_b):
start_row = b * lsh_r
end_row = start_row + lsh_r
band = str(sig[start_row:end_row])
bucket_idx = hash_tables[b][0](''.join(band))
try:
hash_tables[b][1][bucket_idx].append(doc_id)
except KeyError:
hash_tables[b][1][bucket_idx] = [doc_id]
return hash_tables
我什至试图直接在包含 update_hash_table 定义的文件中生成 hash_tables 或在函数内部生成表,但是我总是会出现腌制错误,如何重写代码以将reduce操作的结果存储在变量 lsh 中? 我知道我可以从rdd收集签名并将其转换为列表,但这会非常昂贵,我是否可以在不增加执行时间的情况下执行此操作?