在Spark 1.6.0中使用带有struct数据类型的collect_list函数时出错

时间:2017-11-23 09:46:57

标签: pyspark

执行以下语句时,我在Spark 1.6.0中遇到错误。 grouped_df语句对我不起作用。

from pyspark.sql import functions as F
from pyspark import SQLContext
data = [[1,'2014-01-03', 10],[1,'2014-01-04', 5],[1,'2014-01-05', 15],[1,'2014-01-06' , 20],[2,'2014-02-10', 100],[2,'2014-03-11', 500],[2,'2014-04-15',1500]]
df = sc.parallelize(data).toDF(['id','date','value'])
df.show()
grouped_df = df.groupby("id").agg(F.collect_list(F.struct("date", "value")).alias("list_col"))



Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/taxgard/CPWorkArea/agarwal/python/spark/spark-1.6/python/pyspark/sql/group.py", line 91, in agg
    _to_seq(self.sql_ctx._sc, [c._jc for c in exprs[1:]]))
  File "/opt/taxgard/CPWorkArea/agarwal/python/spark/spark-1.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/opt/taxgard/CPWorkArea/agarwal/python/spark/spark-1.6/python/pyspark/sql/utils.py", line 51, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'No handler for Hive udf class org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCollectList because: Only primitive type arguments are accepted but struct<date:string,value:bigint> was passed as parameter 1..;'

1 个答案:

答案 0 :(得分:1)

您必须使用HiveContext代替SQLContext

from pyspark import SparkContext, HiveContext

sc = SparkContext(appName='my app name')
sql_cntx = HiveContext(sc)

data = [[1,'2014-01-03', 10],[1,'2014-01-04', 5],[1,'2014-01-05', 15],[1,'2014-01-06' , 20],[2,'2014-02-10', 100],[2,'2014-03-11', 500],[2,'2014-04-15',1500]]

rdd = sc.parallelize(data)
df = sql_cntx.createDataFrame(rdd, ['id','date','value'])
# ...