Question

我有RDD LabledPoints。是否可以根据indeces列表选择它的子集？

例如idx=[0,4,5,6,8]，我希望能够获得一个包含元素0,4,5,6和8的新RDD。

请注意，我对可用的随机样本不感兴趣。

Answer 1

Yes, you can either:

Key your RDD by your set of values, put those values in another RDD, then do a leftOuterJoin to merge them, keeping only those in the set.
Put all your values into a broadcast variable (as a simple set) so that it gets shared across executors, the run a filter operation that validates that the points exist in your set.

Choose 1 if the list of values is large, else 2.

Edit to show a code sample for case 1.

val filteringValues = //read the list of values, same as you do your points, just easier 
            .keyBy(_)

val filtered = parsedData
            .keyBy(_.something) // Get the number from your inner structure
            .rigthOuterJoin(filteringValues) // This select only from your subset
            .flatMap(x => x._2._1) // Map it back to the original type.

在Spark-Python中设置RDD

1 个答案: