在Spark-Python中设置RDD

时间:2015-04-24 16:43:17

标签: python apache-spark

我有RDD LabledPoints。是否可以根据indeces列表选择它的子集?

例如idx=[0,4,5,6,8],我希望能够获得一个包含元素0,4,5,6和8的新RDD。

请注意,我对可用的随机样本不感兴趣。

1 个答案:

答案 0 :(得分:2)

Yes, you can either:

  1. Key your RDD by your set of values, put those values in another RDD, then do a leftOuterJoin to merge them, keeping only those in the set.
  2. Put all your values into a broadcast variable (as a simple set) so that it gets shared across executors, the run a filter operation that validates that the points exist in your set.

Choose 1 if the list of values is large, else 2.


Edit to show a code sample for case 1.

val filteringValues = //read the list of values, same as you do your points, just easier 
            .keyBy(_)

val filtered = parsedData
            .keyBy(_.something) // Get the number from your inner structure
            .rigthOuterJoin(filteringValues) // This select only from your subset
            .flatMap(x => x._2._1) // Map it back to the original type.