我是scala的新手,我正在尝试将元组配对为数组类型为Array(Array [String])的RDD,如下所示:
(122abc,223cde,334vbn,445das),(221bca,321dsa),(231dsa,653asd,698poq,897qwa)
我试图从这些数组中创建元组对,以便每个数组的第一个元素是键,而数组的其他任何部分都是一个值。例如,输出看起来像:
122abc 223cde
122abc 334vbn
122abc 445das
221bca 321dsa
231dsa 653asd
231dsa 698poq
231dsa 897qwa
我不知道如何从每个数组中分离出第一个元素,然后将其映射到其他元素。
答案 0 :(得分:2)
如果我没看错,问题的核心与将内部数组的头(第一个元素)与尾部(其余元素)分开有关,可以使用head
和tail
个方法。 RDD的行为与Scala列表非常相似,因此您可以使用纯Scala代码来完成所有操作。
给出以下输入RDD:
val input: RDD[Array[Array[String]]] = sc.parallelize(
Seq(
Array(
Array("122abc","223cde","334vbn","445das"),
Array("221bca","321dsa"),
Array("231dsa","653asd","698poq","897qwa")
)
)
)
以下应做您想做的事:
val output: RDD[(String,String)] =
input.flatMap { arrArrStr: Array[Array[String]] =>
arrArrStr.flatMap { arrStrs: Array[String] =>
arrStrs.tail.map { value => arrStrs.head -> value }
}
}
事实上,由于flatMap
/ map
的组成方式,您可以将其重写为一种理解。
val output: RDD[(String,String)] =
for {
arrArrStr: Array[Array[String]] <- input
arrStr: Array[String] <- arrArrStr
str: String <- arrStr.tail
} yield (arrStr.head -> str)
与您一起使用的最终是个人喜好的问题(尽管在这种情况下,我更喜欢后者,因为您不必缩进太多代码)。
用于验证:
output.collect().foreach(println)
应打印:
(122abc,223cde)
(122abc,334vbn)
(122abc,445das)
(221bca,321dsa)
(231dsa,653asd)
(231dsa,698poq)
(231dsa,897qwa)
答案 1 :(得分:1)
这是经典的折叠操作;但是在Spark中折叠会调用aggregate
:
// Start with an empty array
data.aggregate(Array.empty[(String, String)]) {
// `arr.drop(1).map(e => (arr.head, e))` will create tuples of
// all elements in each row and the first element.
// Append this to the aggregate array.
case (acc, arr) => acc ++ arr.drop(1).map(e => (arr.head, e))
}
解决方案是非火花环境:
scala> val data = Array(Array("122abc","223cde","334vbn","445das"),Array("221bca","321dsa"),Array("231dsa","653asd","698poq","897qwa"))
scala> data.foldLeft(Array.empty[(String, String)]) { case (acc, arr) =>
| acc ++ arr.drop(1).map(e => (arr.head, e))
| }
res0: Array[(String, String)] = Array((122abc,223cde), (122abc,334vbn), (122abc,445das), (221bca,321dsa), (231dsa,653asd), (231dsa,698poq), (231dsa,897qwa))
答案 2 :(得分:1)
将输入元素转换为seq和all,然后尝试编写包装器,该包装器将为您提供List(List(item1,item2), List(item1,item2),...)
尝试下面的代码
val seqs = Seq("122abc","223cde","334vbn","445das")++
Seq("221bca","321dsa")++
Seq("231dsa","653asd","698poq","897qwa")
写一个包装器将seq转换为一对
def toPairs[A](xs: Seq[A]): Seq[(A,A)] = xs.zip(xs.tail)
现在将您的seq作为参数发送,它将给您一对两个
toPairs(seqs).mkString(" ")
将其转换为字符串后,您将获得类似
的输出res8: String = (122abc,223cde) (223cde,334vbn) (334vbn,445das) (445das,221bca) (221bca,321dsa) (321dsa,231dsa) (231dsa,653asd) (653asd,698poq) (698poq,897qwa)
现在,您可以根据需要转换字符串。
答案 3 :(得分:1)
使用df并爆炸。
val df = Seq(
Array("122abc","223cde","334vbn","445das"),
Array("221bca","321dsa"),
Array("231dsa","653asd","698poq","897qwa")
).toDF("arr")
val df2 = df.withColumn("key", 'arr(0)).withColumn("values",explode('arr)).filter('key =!= 'values).drop('arr).withColumn("tuple",struct('key,'values))
df2.show(false)
df2.rdd.map( x => Row( (x(0),x(1)) )).collect.foreach(println)
输出:
+------+------+---------------+
|key |values|tuple |
+------+------+---------------+
|122abc|223cde|[122abc,223cde]|
|122abc|334vbn|[122abc,334vbn]|
|122abc|445das|[122abc,445das]|
|221bca|321dsa|[221bca,321dsa]|
|231dsa|653asd|[231dsa,653asd]|
|231dsa|698poq|[231dsa,698poq]|
|231dsa|897qwa|[231dsa,897qwa]|
+------+------+---------------+
[(122abc,223cde)]
[(122abc,334vbn)]
[(122abc,445das)]
[(221bca,321dsa)]
[(231dsa,653asd)]
[(231dsa,698poq)]
[(231dsa,897qwa)]
更新1:
使用配对的rdd
val df = Seq(
Array("122abc","223cde","334vbn","445das"),
Array("221bca","321dsa"),
Array("231dsa","653asd","698poq","897qwa")
).toDF("arr")
import scala.collection.mutable._
val rdd1 = df.rdd.map( x => { val y = x.getAs[mutable.WrappedArray[String]]("arr")(0); (y,x)} )
val pair = new PairRDDFunctions(rdd1)
pair.flatMapValues( x => x.getAs[mutable.WrappedArray[String]]("arr") )
.filter( x=> x._1 != x._2)
.collect.foreach(println)
结果:
(122abc,223cde)
(122abc,334vbn)
(122abc,445das)
(221bca,321dsa)
(231dsa,653asd)
(231dsa,698poq)
(231dsa,897qwa)