Question

如何使用python在Apache Spark中执行加入KeyValueRDD？

这是我的两个RDD

rddUser:[((u'M', '[68-73]', u'B'), u'TwoFace'), ((u'F', '[33-38]', u'Fr'), u'Catwoman'), ((u'Female', '[23-28]', u'L'), u'HarleyQuinn'), ((u'M', '[75+]', u'L'), u'Joker'), ((u'F', '[28-33]', u'Belgium'), u'PoisonIvy')]
rdd:[((u'F', '[23-28]', u'L'), 180.0), ((u'F', '[28-33]', u'B'), 60.0), ((u'F', '[33-38]', u'Fr'), 56.0), ((u'M', '[68-73]', u'B'), 136.0), ((u'M', '[75+]', u'L'), 98.0)]

我试试这个：

print rddUser.join(rdd).collect()

但此行中的Spark阻止

预期结果（或类似的结果）：

((u'M', '[68-73]', u'B'), u'TwoFace', 136.0)

我该怎么做？

修改：

它在pyspark中运行正常，但是当我在我的脚本中使用它时，脚本在该行等待。 30分钟后，此日志显示：

17/04/27 12:25:22 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 192.168.2.76:40028 in memory (size: 11.8 KB, free: 366.3 MB)
17/04/27 12:25:22 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 192.168.2.76:40028 in memory (size: 5.8 KB, free: 366.3 MB)
17/04/27 12:25:22 INFO ContextCleaner: Cleaned accumulator 135
17/04/27 12:25:22 INFO BlockManagerInfo: Removed broadcast_4_piece0 on 192.168.2.76:40028 in memory (size: 10.7 KB, free: 366.3 MB)
17/04/27 12:25:22 INFO BlockManagerInfo: Removed broadcast_5_piece0 on 192.168.2.76:40028 in memory (size: 399.0 B, free: 366.3 MB)
17/04/27 12:25:22 INFO BlockManagerInfo: Removed broadcast_6_piece0 on 192.168.2.76:40028 in memory (size: 9.5 KB, free: 366.3 MB)
17/04/27 12:25:22 INFO BlockManagerInfo: Removed broadcast_7_piece0 on 192.168.2.76:40028 in memory (size: 5.0 KB, free: 366.3 MB)

1小时30分之后，没有任何附加

Answer 1

加入在pyspark中工作正常，但是，对于您当前的数据，它将创建连接结果，其格式为（k，（v1，v2））而不是（k，v1，v2）期待着。您可以根据需要进行map更改。

以下对我来说还不错 -

rddUser = sc.parallelize([((u'M', '[68-73]', u'B'), u'TwoFace'), 
                          ((u'F', '[33-38]', u'Fr'), u'Catwoman'), 
                          ((u'Female', '[23-28]', u'L'), u'HarleyQuinn'), 
                          ((u'M', '[75+]', u'L'), u'Joker'), 
                          ((u'F', '[28-33]', u'Belgium'), u'PoisonIvy')])

rdd =sc.parallelize([((u'F', '[23-28]', u'L'), 180.0), 
                     ((u'F', '[28-33]', u'B'), 60.0), 
                     ((u'F', '[33-38]', u'Fr'), 56.0), 
                     ((u'M', '[68-73]', u'B'), 136.0), 
                     ((u'M', '[75+]', u'L'), 98.0)])

rddUser.join(rdd).collect()

输出：

[(('M', '[68-73]', 'B'), ('TwoFace', 136.0)),
 (('F', '[33-38]', 'Fr'), ('Catwoman', 56.0)),
 (('M', '[75+]', 'L'), ('Joker', 98.0))]

你如何用Key执行两个KeyValueRDD的连接是使用Python的Spark中的元组？

1 个答案: