Question

auto = sc.textFile("temp/auto_data.csv")
auto = auto.map(lambda x: x.split(","))
header = auto.first()
autoData = auto.filter(lambda a: a!=header)

现在我在autoData中有数据

[[u'', u'ETZ', u'AS1', u'CUT000021', u'THE TU-WHEEL SPARES', u'DIBRUGARH', u'201505', u'LCK   ', u'2WH   ', u'KIT', u'KT-2069CZ', u'18', u'8484'], [u'', u'ETZ', u'AS1', u'CUT000021', u'THE TU-WHEEL SPARES', u'DIBRUGARH', u'201505', u'LCK   ', u'2WH   ', u'KIT', u'KT-2069SZ', u'9', u'5211']]

现在我想在第2和第12（最后）值上执行groupBy()。这该怎么做？

Answer 1

groupBy将生成键的函数作为参数，以便您可以执行以下操作：

autoData.groupBy(lambda row: (row[2], row[12]))

修改：

关于任务you've described in the comments。 groupBy仅收集组中的数据，但不汇总数据。

from operator import add def int_or_zero(s): try: return int(s) except ValueError: return 0 autoData.map(lambda row: (row[2], int_or_zero(row[12]))).reduceByKey(add)
使用groupBy的
极低效版本可能如下所示：

(autoData.map(lambda row: (row[2], int_or_zero(row[12]))) .groupByKey() .mapValues(sum))

如何在PySpark中执行groupBy？

1 个答案: