Question

我正在尝试将新列添加到pyspark数据框中，新列将作为数据帧的某些列的总和派生：

列表 new_cols 包含要汇总的列列表。

  big_df = big_df.withColumn('voice volume from 3G-fixated users',sum(big_df[c] for c in new_cols))

这在pyspark shell中完美无缺。

这是更大代码的一部分，我不会在这里粘贴。

但是当我尝试使用以下参数在集群上提交我的文件时：

 nohup spark-submit  --conf spark.network.timeout=3600s --master=yarn --deploy-mode=cluster --executor-memory 40g --num-executors 8 --executor-cores 8 --driver-memory 20g  --jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar  --files /usr/hdp/current/spark-client/conf/hive-site.xml  --conf spark.executor.heartbeatInterval=3600s recombinate_new_spark.py &

失败并出现以下错误：

Log Type: stdout

Log Upload Time: Mon Jul 24 21:49:21 -0400 2017
Log Length: 7218
Traceback (most recent call last):
File "recombinate_new_spark.py", line 114, in <module>
big_df = big_df.withColumn('3G-fixated voice users',sum(big_df[c] for c in new_cols))
File "/opt/data/data04/yarn/local/usercache/vb4320/appcache/application_1500830506843_32979/container_e712_1500830506843_32979_01_000001/pyspark.zip/pyspark/sql/functions.py", line 39, in _
 File "/opt/data/data04/yarn/local/usercache/vb4320/appcache/application_1500830506843_32979/container_e712_1500830506843_32979_01_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 804, in __call__
File "/opt/data/data04/yarn/local/usercache/vb4320/appcache/application_1500830506843_32979/container_e712_1500830506843_32979_01_000001/py4j-0.9-src.zip/py4j/protocol.py", line 278, in get_command_part
AttributeError: 'generator' object has no attribute '_get_object_id'

很难调试这个，因为它在pyspark shell中工作正常，并且提交的数据大小与在集群上处理完全相同。

提前感谢任何提示和帮助。

将新列添加到spark数据帧（即列的总和）在pyspark shell中可以正常工作，但在群集上不能正常工作

0 个答案: