用reduceByKey()替换groupBykey()

时间:2018-08-03 20:56:08

标签: python apache-spark pyspark pyspark-sql amazon-emr

我正在尝试将int j=0; Car car = miniTable.get(i); content[i][j++]= car.getName(); content[i][j++]= car.getOrigin(); content[i][j++]= car.getColor(); 替换为groupByKey(),我是pyspark和python新手,并且很难确定reudceByKey()操作的lambda函数。

这是代码

reduceByKey()

这是返回值

dd = hive_context.read.orc(orcfile_dir).rdd.map(lambda x: (x[0],x)).groupByKey(25).take(2)

这是可迭代的内容>>> dd [(u'KEY_1', <pyspark.resultiterable.ResultIterable object at 0x107be0c50>), (u'KEY_2', <pyspark.resultiterable.ResultIterable object at 0x107be0c10>)]

dd[0][1]

我的问题是如何用Row(key=u'KEY_1', hash_fn=u'deec95d65ca6b3b4f2e1ef259040aa79', value=u'e7dc1f2a') Row(key=u'KEY_1', hash_fn=u'f8891048a9ef8331227b4af080ecd28a', value=u'fb0bc953') .... ... Row(key=u'KEY_1', hash_fn=u'1b9d2bb2db28603ff21052efcd13f242', value=u'd39714d3') Row(key=u'KEY_1', hash_fn=u'c41b0269706ac423732a6bab24bf8a6a', value=u'ab58db92') 替换并获得与上述相同的输出?

0 个答案:

没有答案