如何按索引方式组合两个RDD [String]?

时间:2017-12-08 03:31:39

标签: scala apache-spark

我正在使用Spark RDD并创建了两个完整长度数组,一个是推文的小时,另一个是推文的文本。我希望将这些结合到一个数据结构(可能是一个元组?)中,我可以根据推文的小时和文本进行过滤,但在结合如何执行此操作之后,我一直在努力。

scala> val split_time = split_date.map(line => line.split(":")).map(word =>
(word(0)))
split_time: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[28] at map 
at <console>:31

scala> split_time.take(10)
res8: Array[String] = Array(17, 17, 17, 17, 17, 17, 17, 17, 17, 17)


scala> val split_text = text.map(line => line.split(":")).map(word => 
(word(1).toLowerCase))
split_text: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[19] at map at <console>:29

scala> split_text.take(10)
res0: Array[String] = Array("add @joemontana to this pic and you've got 
something #nfl https, "are you looking for photo editor, "#colts frank gore 
needs 27 rushing yards to pass jerome bettis and 49 yards to pass ladainian 
tomlinson to move int… https, "rt @nflstreamfree,.....

# combine into tuple
val tweet_tuple = (split_time, split_text)

例如,我希望第17小时的所有推文都带有&#34; colts&#34;提到:

tweet_tuple.filter(tup => tup._1 == 17 && tup._2.toString.matches("colts"))

<console>:40: error: value filter is not a member of (org.apache.spark.rdd.RDD[String], org.apache.spark.rdd.RDD[String])
          tweet_tuple.map(line => line._1 == 17 && line._2.toString.matches("colts"))

2 个答案:

答案 0 :(得分:4)

您应该使用.ziprdds合并到RDD[(String, String)]

例如我创建了两个rdds

val split_time = sparkContext.parallelize(Array("17", "17", "17", "17", "17", "17", "17", "17", "17", "17"))
val split_text = sparkContext.parallelize(Array("17", "17", "17", "17", "colts", "17", "17", "colts", "17", "17"))

zip将上面提到的rdds合并到RDD[Tuple2[String, String]]

val tweet_tuple = split_time.zip(split_text)

结合所有你需要的是申请.filter

tweet_tuple.filter(line => line._1 == "17" && line._2.toString.matches("colts"))

输出应为

(17,colts)
(17,colts)

<强>更新

由于您的split_text rdd 句子集合,因此应使用contains代替matches。因此,在您zip ped之后,以下逻辑应该起作用。

tweet_tuple.filter(line => line._1 == "17" && line._2.toString.contains("colts"))

答案 1 :(得分:1)

The answerRamesh Maharjan只能在非常具体的假设下工作:

  • 两个RDD都具有相同数量的分区。
  • 相应的分区具有相同数量的元素。

这对ParallelCollectionRDD来说是微不足道的,但一般来说很难或不可能。

join

更好,但更昂贵
split_time.zipWithIndex.map(_.swap).join(
  split_text.zipWithIndex.map(_.swap)
).values

或:

val split_time_with_index = split_time.zipWithIndex.map(_.swap)
val split_text_with_index = split_text.zipWithIndex.map(_.swap) 

val partitioner = new org.apache.spark.RangePartitioner(
  split_time.getNumPartitions, split_time
)

split_time.join(split_text, partitioner)