如何强制RDD MAP遍历每一行,而不是有时在给定时间遍历两行?
book = []
def customer_order_agg(row):
book.append(row['order_number'])
return book
sample2 = df.rdd.map(customer_order_agg)
print(sample2.take(5))
我得到的结果是这个
[[721], [721, 722, 723], [721, 722, 723], [721, 722, 723, 724, 725], [721, 722, 723, 724, 725]]
我希望这是
[[721], [721, 722], [721, 722, 723], [721, 722, 723, 724], [721, 722, 723, 724, 725]]
我的df内容
2019-02-27 01:21:49.839392|1|1|136.14|20000.0|0.0|20000.0|0|721|retretre|
2019-02-27 01:21:49.839392|1|1|135.0|3000.0|0.0|3000.0|0|722|tetr|
2019-02-27 01:21:49.839392|1|1|135.0|70000.0|0.0|70000.0|0|723|retete|
2019-02-27 01:21:49.839392|1|1|135.0|1000.0|0.0|1000.0|0|724|etrertert|
2019-02-27 01:21:49.839392|1|1|135.0|200000.0|0.0|200000.0|0|725|00tertL|