假设我想从外部HDFS数据库中读取数据,并且我的群集中有3个工作者(一个可能更接近sc.textFile("hdfs://external_host/file.txt")
- 但不在同一主机上。)
.textFile(..)
据我所知,Spark会根据底层RDD的位置来安排任务。但在哪个工人(即执行者)被安排external_host
(因为我们没有"tokens": [
{
"index": 1,
"word": "I",
"originalText": "I",
"lemma": "I",
"characterOffsetBegin": 0,
"characterOffsetEnd": 5,
"pos": "NNP",
"ner": "PERSON",
"before": "",
"after": " "
},
{
"index": 2,
"word": "played",
"originalText": "played",
"lemma": "play",
"characterOffsetBegin": 6,
"characterOffsetEnd": 11,
"pos": "VBZ",
"ner": "O",
"before": " ",
"after": " "
},
{
"index": 3,
"word": "football",
"originalText": "football",
"lemma": "football",
"characterOffsetBegin": 22,
"characterOffsetEnd": 24,
"pos": "IN",
"ner": "O",
"before": " ",
"after": " "
}
]
上运行执行者?)
我想它会将HDFS块作为分区加载到工作器内存中,但是Spark如何确定最佳工作者是什么(我认为它根据延迟或其他东西选择最接近的,这是正确的吗?)