Apache Spark中的过滤器无法正常工作

时间:2016-06-02 00:36:25

标签: python apache-spark pyspark

我正在执行以下代码:

ordersRDD = sc.textFile("/user/cloudera/sqoop_import/orders")
orderItemsRDD = sc.textFile("/user/cloudera/sqoop_import/order_items")

ordersParsedRDD = ordersRDD.filter(lambda rec: rec.split(",")[3] in "CANCELED").map(lambda rec: (int(rec.split(",")[0]), rec))
orderItemsParsedRDD = orderItemsRDD.map(lambda rec: (int(rec.split(",")[1]), float(rec.split(",")[4])))
orderItemsAgg = orderItemsParsedRDD.reduceByKey(lambda acc, value: (acc + value))

ordersJoinOrderItems = orderItemsAgg.join(ordersParsedRDD)

for i in ordersJoinOrderItems.filter(lambda rec: rec[1][0] >= 1000).take(5): print(i)

最终结果没有显示给我,显示在这种情况下停止。

Join命令之前,在Join数据未显示之后,我能够显示所有记录。
错误如下所示:

16/06/01 17:26:05 INFO storage.ShuffleBlockFetcherIterator: Getting 12 non-empty blocks out of 12 blocks
16/06/01 17:26:05 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/06/01 17:26:05 INFO python.PythonRunner: Times: total = 67, boot = -57, init = 61, finish = 63
16/06/01 17:26:05 INFO python.PythonRunner: Times: total = 73, boot = -68, init = 75, finish = 66
16/06/01 17:26:05 INFO executor.Executor: Finished task 11.0 in stage 289.0 (TID 2689). 1461 bytes result sent to driver
16/06/01 17:26:05 INFO scheduler.TaskSetManager: Finished task 11.0 in stage 289.0 (TID 2689) in 153 ms on localhost (12/24)

1 个答案:

答案 0 :(得分:0)

我首先打印ordersJoinOrderItems上的计数。如果它大于0,那么它会暗示你的过滤器是罪魁祸首。