使用combineByKey()时出错

时间:2017-07-20 12:46:03

标签: apache-spark pyspark

joindf.printSchema()
root
 |-- order_customer_id: string (nullable = true)
 |-- order_date: string (nullable = true)
 |-- order_id: string (nullable = true)
 |-- order_status: string (nullable = true)
 |-- order_item_id: string (nullable = true)
 |-- order_item_order_id: string (nullable = true)
 |-- order_item_product_id: string (nullable = true)
 |-- order_item_product_price: string (nullable = true)
 |-- order_item_quantity: string (nullable = true)
 |-- order_item_subtotal: string (nullable = true)



joindf.show(5)
+-----------------+--------------------+--------+------------+-------------+-------------------+---------------------+------------------------+-------------------+-------------------+
|order_customer_id|          order_date|order_id|order_status|order_item_id|order_item_order_id|order_item_product_id|order_item_product_price|order_item_quantity|order_item_subtotal|
+-----------------+--------------------+--------+------------+-------------+-------------------+---------------------+------------------------+-------------------+-------------------+
|            10153|2013-08-17 00:00:...|    4061|    COMPLETE|        10153|               4080|                  365|                   59.99|                  4|             239.96|
|            10153|2014-01-12 00:00:...|   27596|     PENDING|        10153|               4080|                  365|                   59.99|                  4|             239.96|
|            10153|2014-07-18 00:00:...|   56604|      CLOSED|        10153|               4080|                  365|                   59.99|                  4|             239.96|
|            10153|2013-08-14 00:00:...|   58259|    COMPLETE|        10153|               4080|                  365|                   59.99|                  4|             239.96|
|            10153|2013-08-14 00:00:...|   58269|     PENDING|        10153|               4080|                  365|                   59.99|                  4|             239.96|
+-----------------+--------------------+--------+------------+-------------+-------------------+---------------------+------------------------+-------------------+-------------------+

我在此RDD上使用combineByKey()来生成一个结果,该结果给出了每天每个状态的总订单和总金额。 以下是代码:

 joindf.map(lambda x: ((str(x[1]),str(x[3])),(float(x[9]),int(x[2]))))
 .combineByKey(lambda v: (v[0],set(v[1])) , 
               lambda acc,v: (acc[0]+v[0],v[1].add(acc[1])), 
               lambda acc1,acc2 : (acc1[0]+acc2[0],acc1[1].update(acc2[1])))

这是错误的。

  

TypeError:' int'对象不可迭代

我哪里出错了?请帮助。

1 个答案:

答案 0 :(得分:0)

您已经拥有一个数据帧,您无需将其转换为RDD并执行操作。

据我所知您可以执行以下操作,但是代码是在scala中您可以将其转换为python

joindf.groupBy(split($"order_date", " ")(0).as("order_date"))
    .agg(sum($"order_item_quantity"), sum($"order_item_subtotal"))

希望这有帮助!