Question

在我的PySpark应用程序中，我有两个RDD＆＃39>：

项目 - 包含所有有效项目的项目ID和项目名称。约100000件。
attributeTable - 其中包含字段用户ID，项目ID以及此顺序中此组合的属性值。这些是系统中每个用户 - 项目组合的特定属性。这个RDD有几百个1000行。

我想丢弃attributeTable RDD中与RDD项目中的有效项目ID（或名称）不对应的所有行。换句话说，按项目ID进行半连接。例如，如果这些是R数据框，我会做 semi_join（attributeTable，items，by =＆＃34; itemID＆＃34;）

我首先尝试了以下方法，但发现这需要永远返回（在我的PC上的VM上运行的本地Spark安装）。可以理解的是，因为涉及到如此多的比较：

# Create a broadcast variable of all valid item IDs for doing filter in the drivers
validItemIDs = sc.broadcast(items.map(lambda (itemID, itemName): itemID)).collect())
attributeTable = attributeTable.filter(lambda (userID, itemID, attributes): itemID in set(validItemIDs.value))

经过一番摆弄后，我发现以下方法运行得非常快（我的系统只需要一分钟左右）。

# Create a broadcast variable for item ID to item name mapping (dictionary) 
itemIdToNameMap = sc.broadcast(items.collectAsMap())

# From the attribute table, remove records that don't correspond to a valid item name.
# First go over all records in the table and add a dummy field indicating whether the item name is valid
# Then, filter out all rows with invalid names. Finally, remove the dummy field we added.
attributeTable = (attributeTable
                  .map(lambda (userID, itemID, attributes): (userID, itemID, attributes, itemIdToNameMap.value.get(itemID, 'Invalid')))
                  .filter(lambda (userID, itemID, attributes, itemName): itemName != 'Invalid')
                  .map(lambda (userID, itemID, attributes, itemName): (userID, itemID, attributes)))

虽然这对我的应用程序来说效果很好，但感觉更像是一种肮脏的解决方法，而且我很确定在Spark中必须有另一种更清洁或惯用的方法（可能更有效）。你会建议什么？我是Python和Spark的新手，所以如果你能指出我正确的资源，任何RTFM建议也会有所帮助。

我的Spark版本是1.3.1。

Answer 1

只需定期加入，然后放弃＆＃34;查找＆＃34;关系（在你的情况下items rdd）。

如果这些是你的RDD （例子来自另一个答案）：

items = sc.parallelize([(123, "Item A"), (456, "Item B")])
attributeTable = sc.parallelize([(123456, 123, "Attribute for A")])

然后你做：

attributeTable.keyBy(lambda x: x[1])
  .join(items)
  .map(lambda (key, (attribute, item)): attribute)

因此，您只有来自attributeTable RDD的元组，这些元组在items RDD中有相应的条目：

[(123456, 123, 'Attribute for A')]

按照另一个答案中的建议通过leftOuterJoin进行此操作也可以完成这项任务，但效率较低。另外，另一个答案是使用items半连接attributeTable而不是attributeTable半连接items。

Answer 2

正如其他人所指出的，这可能是通过利用DataFrames最容易实现的。但是，您可以使用leftOuterJoin和filter函数来实现预期目标。像下面这样的东西可能就足够了：

items = sc.parallelize([(123, "Item A"), (456, "Item B")])
attributeTable = sc.parallelize([(123456, 123, "Attribute for A")])
sorted(items.leftOuterJoin(attributeTable.keyBy(lambda x: x[1]))
       .filter(lambda x: x[1][1] is not None)
       .map(lambda x: (x[0], x[1][0])).collect())

返回

[(123, 'Item A')]

在两个Spark RDD（在PySpark中）进行半连接的正确方法是什么？

2 个答案: