Question

我试图将经过训练的Faiss索引部署到PySpark并进行分布式搜索。因此，整个过程包括：

预处理
加载Faiss索引（〜15G）并进行Faiss搜索
后处理并写入HDFS

我将每个任务的CPU设置为10（spark.task.cpus=10），以便进行多线程搜索。但是第1步和第3步每个任务只能使用1个CPU。为了利用所有CPU，我想在第1步和第3步之前设置spark.task.cpus=1。我尝试了RuntimeConfig的set方法，但看来这使我的程序陷于停顿。关于如何在运行时更改配置或如何优化此问题的任何建议？

代码示例：

def load_and_search(x, model_path):
    faiss_idx = faiss.read_index(model_path)
    q_vec = np.concatenate(x)
    _, idx_array = faiss_idx.search(q_vec, k=10)
    return idx_array


data = sc.textFile(input_path)

# preprocess, only used one cpu per task
data = data.map(lambda x: x)

# load faiss index and search, used multiple cpus per task
data = data.mapPartitioins(lambda x: load_and_search(x, model_path))

# postprocess and write, one cpu per task
data = data.map(lambda x: x).saveAsTextFile(result_path)

Answer 1

替代方法：对步骤1和3使用mapPartitions。然后，在每个工作程序中使用一个多处理池来并行映射分区中的项目。这样，您可以使用分配给工作程序的所有cpus，而无需更改配置（我完全不知道这是不可能的）。

伪代码：

def item_mapper(item):
    return ...

def partition_mapper(partition):
    with mp.Pool(processes=10) as pool:
        yield from pool.imap(item_mapper, partition)

rdd.mapPartitions(partition_mapper)

Answer 2

您可以通过以下方式更改sparkContext属性：

conf = sc._conf.setAll([('spark.task.cpus','1')])
sc._conf.getAll()
data = data.map(lambda x: x)

conf = sc._conf.setAll([('spark.task.cpus','10')])
sc._conf.getAll()
# load faiss index and search, used multiple cpus per task
data = data.mapPartitioins(lambda x: load_and_search(x, model_path))

conf = sc._conf.setAll([('spark.task.cpus','1')])
sc._conf.getAll()
# postprocess and write, one cpu per task
data = data.map(lambda x: x).saveAsTextFile(result_path)

getAll（）可以被删除，仅用于检查当前配置即可添加。

在运行时更改PySpark的配置

2 个答案: