Question

在PySpark中，我理解python worker用于在工作节点上执行（至少一些）计算（如https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals所述）。

在我的测试设置中，我试图让Spark使用4个工作线程（在独立的机器上），但似乎只创建了1个python worker：

import socket
import threading

spark = SparkSession\
    .builder\
    .master('local[4]')\
    .appName("PythonPi")\
    .getOrCreate()

partitions = 4

# Print the ident of the local thread:
print(str(threading.get_ident()))

# Print the idents of the threads inside the python workers:
thread_ids = spark.sparkContext.parallelize(range(1, partitions + 1), partitions)\
.map(lambda x: ' threadid: ' + str(threading.get_ident())).collect()


print(thread_ids)

spark.stop()

输出：

140226126948096
[' threadid: 139948131018496', ' threadid: 139948131018496', ' threadid: 139948131018496', ' threadid: 139948131018496']

查看这些线程ID，似乎使用相同的python线程（在同一个worker中）来处理所有分区？或者该代码是在python worker外部进行评估的？

是否有其他方法可以访问python worker的ID - 所以我可以理解代码的运行位置？

Answer 1

你的错误是相信PySpark使用线程。它不是。它通常使用进程和线程ID，仅在进程内是唯一的（并且可以重用）。

所以你的代码应该是：

var X = Math.floor((Math.random() * 19) + 1) * 40;
var Y = Math.floor((Math.random() * 19) + 1) * 40;      

function addplayer1() {
  var player1 = '<div id="player1"></div>';
  $("#map").append(player1);
  $("#player1").css({
    top: X, 
    left: Y, 
    position: "absolute"
  });
};

addplayer1();

PySpark中的多个Python工作者（或工作线程）？

1 个答案: