Question

例如，我有一个带有6个从站的火花簇（每个从站有4个CPU）。我有成千上万的文件应该由奴隶处理。这些文件位于某些FTP。

由于我有24个CPU内核，如何为每个从属分配任务？

这是伪代码工作流程（通过python）

获取所有文件名列表all_files_list和all_files_length = len(all_files_list)

定义文件处理功能

# this function will be executed by each slave
# download files from files list, and process the files by each slave
def file_process(files_list, slave):
    files = download_files_from_ftp(files_list)
    process_file_list_by_slave_at_cpu(files, slave)

在火花驱动程序

中调用文件处理功能

sc = spark.sparkContext
for slave in range(6): # 6 slave
        index_begin = (all_files_length/6)*(slave) # files list index begin
        index_end = (all_files_length/6)*(slave+1)# files list index end
        files_list = all_files_list[index_begin:index_end] # the files list which should be processed by each cpu of slaves
        files_list_rdd = sc.parallelize(files_list) # create rdd 
        file_process(files_list_rdd, slave) # call the file process function defined at step 2

通过spark集群编程实现逻辑的任何帮助？

Answer 1

首先，忘掉你的火箭代码中的奴隶和cpus。它是集群管理员的责任;它将安排和重新安排（如果任务失败），等等。

二。我认为甚至不可能创建6个RDD，然后由一个从站并行创建每个进程。（但是可以创建一个包含6个文件的RDD，然后对其进行处理）

第三。如果本质上，您希望使用4个核心处理每个N个文件，那么您需要编写处理单个文件的spark app，每个应用程序spark.cores.max 4仅为4个核心配置群集，最后提交N个应用程序通过使用bash / python /任何脚本到您的集群。这样，您的群集将同时执行6个应用程序，但您无法确定一个应用程序的4个核心是否在同一台计算机上。

如何控制每个火花工人节点做一些特殊的工作？

1 个答案: