我是apache spark的新手,不知道我是否误解了reduceByKey或遇到了一个bug。因为,我使用了spark-1.4.1-bin-hadoop1版本 spark-1.4.1-bin-hadoop2中的python Cassandra接口问题。
reduceByKey(lambda x,y:y [0])返回最后一个元组的第一个值, 但reduceByKey(lambda x,y:x [0])抛出异常。
尝试使用reduceByKey(lambda x,y:x [0] + y [0]),按键求和,但该语句抛出与x [0]相同的异常。
代码片段:
import sys
from pyspark import SparkContext, SparkConf
from pyspark import StorageLevel
from pyspark.sql import SQLContext, Row
from pyspark.sql.types import *
import h5py
import sys
import numpy
import os
import datetime
if __name__ == "__main__":
sc_conf = SparkConf().setAppName("VIIRS_QC").set("spark.default.parallelism", "49").set("spark.storage.memoryFraction", "0.75")
sc = SparkContext(conf=sc_conf)
sqlContext=SQLContext(sc)
f=h5py.File("/mnt/NAS/pmacharr/sample_20130918/GMTCO_npp_d20130919_t0544413_e0546054_b09816_c20130919063740340635_noaa_ops.h5", 'r')
result = f["/All_Data/VIIRS-MOD-GEO-TC_All/Latitude"]
myLats = numpy.ravel(result).tolist()
...
t1 = numpy.dstack((myLats, myLons, myArray, myM2_radiance, myDNP))
t1 = t1.tolist()
x=sc.parallelize(t1[0][123401:123410])
print t1[0][123401:123410]
print "input list=", t1[0][123401:123410]
y=x.map(
lambda (lat, lon, m6_rad, m2_rad, dn):
((round(lat,0),round(lon,0),dn), (m2_rad,m6_rad))
)
print "map"
print y.collect()
print "reduceByKey(lambda x,y: x)=", y.reduceByKey(lambda x,y: x ).collect()
print "reduceByKey(lambda x,y: y)=", y.reduceByKey(lambda x,y: y ).collect()
print "reduceByKey(lambda x,y: y[0])=", y.reduceByKey(lambda x,y: y[0]).collect()
print "reduceByKey(lambda x,y: x[0])=", y.reduceByKey(lambda x,y: x[0]).collect()
sc.stop()
exit()
输出:
./bin/spark-submit --driver-class-path ./lib/spark-examples-1.4.1-hadoop1.0.4.jar ./agg_v.py
input list= [
[12.095850944519043, 111.84786987304688, 41252.0, 7469.0, 16.0],
[12.094693183898926, 111.84053802490234, 40811.0, 7444.0, 16.0],
[12.093526840209961, 111.83319091796875, 40778.0, 7446.0, 16.0],
[12.092370986938477, 111.82584381103516, 39389.0, 7352.0, 16.0],
[12.091206550598145, 111.81849670410156, 42592.0, 7602.0, 16.0],
[12.09003734588623, 111.8111343383789, 38572.0, 7328.0, 16.0],
[12.088878631591797, 111.80377960205078, 46203.0, 7939.0, 16.0],
[12.087711334228516, 111.7964096069336, 42690.0, 7608.0, 16.0],
[12.08655071258545, 111.78905487060547, 40942.0, 7478.0, 16.0]
]
map=[
((12.0, 112.0, 16.0), (7469.0, 41252.0)),
((12.0, 112.0, 16.0), (7444.0, 40811.0)),
((12.0, 112.0, 16.0), (7446.0, 40778.0)),
((12.0, 112.0, 16.0), (7352.0, 39389.0)),
((12.0, 112.0, 16.0), (7602.0, 42592.0)),
((12.0, 112.0, 16.0), (7328.0, 38572.0)),
((12.0, 112.0, 16.0), (7939.0, 46203.0)),
((12.0, 112.0, 16.0), (7608.0, 42690.0)),
((12.0, 112.0, 16.0), (7478.0, 40942.0))
]
reduceByKey(lambda x,y: x)= [((12.0, 112.0, 16.0), (7469.0, 41252.0))]
reduceByKey(lambda x,y: y)= [((12.0, 112.0, 16.0), (7478.0, 40942.0))]
reduceByKey(lambda x,y: y[0])= [((12.0, 112.0, 16.0), 7478.0)]
reduceByKey(lambda x,y: x[0])=
15/09/24 12:02:39 ERROR Executor: Exception in task 14.0 in stage 8.0 (TID 406)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/apps/ots/spark-1.4.1-bin-hadoop1/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
process()
...
print "reduceByKey(lambda x,y: x[0])=", y.reduceByKey(lambda x,y: x[0]).collect()
TypeError: 'float' object has no attribute '__getitem__'
答案 0 :(得分:0)
使用pyspark:
>>> t1=[
... [12.095850944519043, 111.84786987304688, 41252.0, 7469.0, 16.0],
... [12.094693183898926, 111.84053802490234, 40811.0, 7444.0, 16.0],
... ]
>>> t1
[[12.095850944519043, 111.84786987304688, 41252.0, 7469.0, 16.0],[12.094693183898926, 111.84053802490234, 40811.0, 7444.0, 16.0]]
>>> x=sc.parallelize(t1)
>>> y2=x.map(lambda (lat, lon, m6_rad, m2_rad, dn):((round(lat,0),round(lon,0),dn), (m6_rad, m2_rad)))
>>> y2.collect()
[((12.0, 112.0, 16.0), (41252.0, 7469.0)), ((12.0, 112.0, 16.0), (40811.0, 7444.0))]
>>> y2.reduceByKey(lambda (x), y: x[0]+y[0]).collect()
[((12.0, 112.0, 16.0), 82063.0)]
>>>
或者可以这样做:
>>> y2.reduceByKey(lambda x, y: (x[0]+y[0], 0)).collect()
[((12.0, 112.0, 16.0), (82063.0, 0))]
>>> y2.reduceByKey(lambda x, y: (x[1]+y[1], 0)).collect()
[((12.0, 112.0, 16.0), (14913.0, 0))]
>>>
不确定哪种方式是“最佳”方式,但它产生了我所追求的方式。
以不同方式实施地图会“更好”吗?