我正在执行以下代码:
ordersRDD = sc.textFile("/user/cloudera/sqoop_import/orders")
orderItemsRDD = sc.textFile("/user/cloudera/sqoop_import/order_items")
ordersParsedRDD = ordersRDD.filter(lambda rec: rec.split(",")[3] in "CANCELED").map(lambda rec: (int(rec.split(",")[0]), rec))
orderItemsParsedRDD = orderItemsRDD.map(lambda rec: (int(rec.split(",")[1]), float(rec.split(",")[4])))
orderItemsAgg = orderItemsParsedRDD.reduceByKey(lambda acc, value: (acc + value))
ordersJoinOrderItems = orderItemsAgg.join(ordersParsedRDD)
for i in ordersJoinOrderItems.filter(lambda rec: rec[1][0] >= 1000).take(5): print(i)
最终结果没有显示给我,显示在这种情况下停止。
在Join
命令之前,在Join
数据未显示之后,我能够显示所有记录。
错误如下所示:
16/06/01 17:26:05 INFO storage.ShuffleBlockFetcherIterator: Getting 12 non-empty blocks out of 12 blocks
16/06/01 17:26:05 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/06/01 17:26:05 INFO python.PythonRunner: Times: total = 67, boot = -57, init = 61, finish = 63
16/06/01 17:26:05 INFO python.PythonRunner: Times: total = 73, boot = -68, init = 75, finish = 66
16/06/01 17:26:05 INFO executor.Executor: Finished task 11.0 in stage 289.0 (TID 2689). 1461 bytes result sent to driver
16/06/01 17:26:05 INFO scheduler.TaskSetManager: Finished task 11.0 in stage 289.0 (TID 2689) in 153 ms on localhost (12/24)
答案 0 :(得分:0)
我首先打印ordersJoinOrderItems上的计数。如果它大于0,那么它会暗示你的过滤器是罪魁祸首。