我正在尝试将新列添加到pyspark数据框中,新列将作为数据帧的某些列的总和派生:
列表 new_cols 包含要汇总的列列表。
big_df = big_df.withColumn('voice volume from 3G-fixated users',sum(big_df[c] for c in new_cols))
这在pyspark shell中完美无缺。
这是更大代码的一部分,我不会在这里粘贴。
但是当我尝试使用以下参数在集群上提交我的文件时:
nohup spark-submit --conf spark.network.timeout=3600s --master=yarn --deploy-mode=cluster --executor-memory 40g --num-executors 8 --executor-cores 8 --driver-memory 20g --jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar --files /usr/hdp/current/spark-client/conf/hive-site.xml --conf spark.executor.heartbeatInterval=3600s recombinate_new_spark.py &
失败并出现以下错误:
Log Type: stdout
Log Upload Time: Mon Jul 24 21:49:21 -0400 2017
Log Length: 7218
Traceback (most recent call last):
File "recombinate_new_spark.py", line 114, in <module>
big_df = big_df.withColumn('3G-fixated voice users',sum(big_df[c] for c in new_cols))
File "/opt/data/data04/yarn/local/usercache/vb4320/appcache/application_1500830506843_32979/container_e712_1500830506843_32979_01_000001/pyspark.zip/pyspark/sql/functions.py", line 39, in _
File "/opt/data/data04/yarn/local/usercache/vb4320/appcache/application_1500830506843_32979/container_e712_1500830506843_32979_01_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 804, in __call__
File "/opt/data/data04/yarn/local/usercache/vb4320/appcache/application_1500830506843_32979/container_e712_1500830506843_32979_01_000001/py4j-0.9-src.zip/py4j/protocol.py", line 278, in get_command_part
AttributeError: 'generator' object has no attribute '_get_object_id'
很难调试这个,因为它在pyspark shell中工作正常,并且提交的数据大小与在集群上处理完全相同。
提前感谢任何提示和帮助。