Spark安排.textFile()任务在哪里

时间:2018-04-25 14:25:25

标签: apache-spark hdfs hadoop-partitioning

假设我想从外部HDFS数据库中读取数据,并且我的群集中有3个工作者(一个可能更接近sc.textFile("hdfs://external_host/file.txt") - 但不在同一主机上。)

.textFile(..)

据我所知,Spark会根据底层RDD的位置来安排任务。但在哪个工人(即执行者)被安排external_host(因为我们没有"tokens": [ { "index": 1, "word": "I", "originalText": "I", "lemma": "I", "characterOffsetBegin": 0, "characterOffsetEnd": 5, "pos": "NNP", "ner": "PERSON", "before": "", "after": " " }, { "index": 2, "word": "played", "originalText": "played", "lemma": "play", "characterOffsetBegin": 6, "characterOffsetEnd": 11, "pos": "VBZ", "ner": "O", "before": " ", "after": " " }, { "index": 3, "word": "football", "originalText": "football", "lemma": "football", "characterOffsetBegin": 22, "characterOffsetEnd": 24, "pos": "IN", "ner": "O", "before": " ", "after": " " } ] 上运行执行者?)

我想它会将HDFS块作为分区加载到工作器内存中,但是Spark如何确定最佳工作者是什么(我认为它根据延迟或其他东西选择最接近的,这是正确的吗?)

0 个答案:

没有答案