Pyspark中的Py4JJavaError

时间:2018-02-05 16:38:20

标签: python-2.7 apache-spark pyspark

我正在使用Python API处理Spark。以下是我的代码。当我执行wordCount.first()时。我收到了ValueError:需要多于1个值来解压缩。任何关于上述错误的信息都将受到赞赏。感谢...

#create an RDD with textFile method
text_data_file=sc.textFile('/resources/yelp_labelled.txt')

#import the required library for word count operation
from operator import add
#Use filter to return RDD for words length greater than zero
wordCountFilter=text_data_file.filter(lambda x:len(x)>0)
#use flat map to split each line into words
wordFlatMap=wordCountFilter.flatMap(lambda x: x.split())
#map each key with value 1 using map function
wordMapper=wordFlatMap.flatMap(lambda x:(x,5))
#Use reducebykey function to reduce above mapped keys
#returns the key-value pairs by adding values for similar keys
wordCount=wordMapper.reduceByKey(add)
#view the first element
wordCount.first()
File "/home/notebook/spark-1.6.0-bin-`hadoop2.6/python/lib/pyspark.zip/pyspark/shuffle.py", line 236, in mergeValues for k, v in iterator: ValueError: need more than 1 value to unpack`

1 个答案:

答案 0 :(得分:1)

你的错误在这里:

select * from Calls C
where C.patientID in(
    select CL.patientID from Calls CL
    group by CL.patientID
    having COUNT(CL.callID) > 1
    --and CL.scheduledDatetime <> MAX(CL.scheduledDatetime)
)

应该是

wordMapper=wordFlatMap.flatMap(lambda x:(x,5))

否则你只是发出

wordMapper=wordFlatMap.map(lambda x:(x,5))

x

作为单独的值。 Spark将尝试扩展5 并失败,它的长度不等于2.否则它将尝试解包5并失败。