Pyspark:添加新列的行总和值超过255列

时间:2018-07-06 03:48:35

标签: pyspark sum arguments python-3.6

我需要找到大约900列的行值之和,我在此链接Spark - Sum of row values中应用了该函数

from functools import reduce

def superSum(*cols):
   return reduce(lambda a, b: a + b, cols)

add = udf(superSum)

df.withColumn('total', add(*[df[x] for x in df.columns])).show()

但是我遇到了这个错误

Py4JJavaError: An error occurred while calling o1005.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "***********\pyspark\worker.py", line 218, in main
  File "***********\pyspark\worker.py", line 147, in read_udfs
  File "<string>", line 1
SyntaxError: more than 255 arguments

1 个答案:

答案 0 :(得分:2)

我给出了相同的错误superSum函数,但是下面的代码可以工作,我猜udf函数不能使用超过255个参数。 python3

import operator
from functools import reduce
import findspark
findspark.init() # replace with your spark path
from pyspark import SparkConf, SparkContext

from pyspark.sql import SQLContext
from pyspark.sql import functions as F
from pyspark.sql import Row

conf = SparkConf().setMaster("local").setAppName("My App")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)


df = sqlContext.createDataFrame([
    Row(**{str(i):0 for i in range(300)})
])

df \
    .withColumn('total', reduce(operator.add, map(F.col, df.columns))).show()