Question

我尝试在本地连接到Hadoop集群，我要做的第一件事就是设置这个变量

export HADOOP_USER_NAME=hdfs
export HADOOP_CONF_DIR=yarnconfig

在yarnconfig我在yarn-site.xml

中有这个

<configuration>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>{Hadoop_Cluster_IP}</value>
    </property>
    <property>
        <name>yarn.resourcemanager.address</name>
        <value>${yarn.resourcemanager.hostname}:8050</value>
    </property>
</configuration>

此处{Hadoop_Cluster_IP}是我尝试连接的Hadoop群集的IP地址的占位符，我出于安全原因未显示该地址。

我需要访问此群集上的一堆文件，并通过运行交互式会话（./bin/pyspark）测试我的脚本并执行以下操作

raw_logs = sc.textFile("hdfs://{Hadoop_Cluster_IP}:{Hadoop_Cluster_PORT}/logs/20*/*/*")

其中{Hadoop_Cluster_IP}和{Hadoop_Cluster_PORT}是Hadoop IP地址和端口号的占位符。

为了检查连接是否已正确建立，我执行以下操作

raw_logs.count()

我得到以下

   16/08/10 12:17:26 INFO SparkContext: Starting job: count at <stdin>:1
16/08/10 12:17:26 INFO DAGScheduler: Got job 2 (count at <stdin>:1) with 54 output partitions
16/08/10 12:17:26 INFO DAGScheduler: Final stage: ResultStage 3(count at <stdin>:1)
16/08/10 12:17:26 INFO DAGScheduler: Parents of final stage: List()
16/08/10 12:17:26 INFO DAGScheduler: Missing parents: List()
16/08/10 12:17:26 INFO DAGScheduler: Submitting ResultStage 3 (PythonRDD[10] at count at <stdin>:1), which has no missing parents
16/08/10 12:17:26 INFO MemoryStore: ensureFreeSpace(7864) called with curMem=253695, maxMem=555755765
16/08/10 12:17:26 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 7.7 KB, free 529.8 MB)
16/08/10 12:17:26 INFO MemoryStore: ensureFreeSpace(4990) called with curMem=261559, maxMem=555755765
16/08/10 12:17:26 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 4.9 KB, free 529.8 MB)
16/08/10 12:17:26 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:60070 (size: 4.9 KB, free: 530.0 MB)
16/08/10 12:17:26 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:861
16/08/10 12:17:26 INFO DAGScheduler: Submitting 54 missing tasks from ResultStage 3 (PythonRDD[10] at count at <stdin>:1)
16/08/10 12:17:26 INFO TaskSchedulerImpl: Adding task set 3.0 with 54 tasks
16/08/10 12:20:44 INFO PythonRunner: Times: total = 951563, boot = 3, init = 9691, finish = 941869
16/08/10 12:20:44 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 2125 bytes result sent to driver
16/08/10 12:20:44 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, localhost, ANY, 2224 bytes)
16/08/10 12:20:44 INFO Executor: Running task 4.0 in stage 0.0 (TID 4)
16/08/10 12:20:44 INFO HadoopRDD: Input split: hdfs://_.json:134217728+134217728
16/08/10 12:20:44 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 955006 ms on localhost (1/54)
    16/08/10 12:17:26 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 7.7 KB, free 529.8 MB)
    16/08/10 12:17:26 INFO MemoryStore: ensureFreeSpace(4990) called with curMem=261559, maxMem=555755765
    16/08/10 12:17:26 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 4.9 KB, free 529.8 MB)
    16/08/10 12:17:26 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:60070 (size: 4.9 KB, free: 530.0 MB)
    16/08/10 12:17:26 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:861
    16/08/10 12:17:26 INFO DAGScheduler: Submitting 54 missing tasks from ResultStage 3 (PythonRDD[10] at count at <stdin>:1)
    16/08/10 12:17:26 INFO TaskSchedulerImpl: Adding task set 3.0 with 54 tasks

在我查看的文件夹中，只有24个文件（总大小接近4 GB），但正如您所看到的，甚至只需几分钟即可启动该过程。

你知道为什么pySpark如此缓慢？

pySpark陷入困境，并没有做任何事情

0 个答案: