根据列表Scala的值将记录拆分为单个记录

时间:2015-11-08 07:30:52

标签: scala apache-spark

我有一个像RDD这样的记录, 名字,姓氏,DoB,年龄,电子邮件。这里的电子邮件是一个列表:

Vikash, Singh, 19-12-1982, 32, {abc@email.com, def@email.com}

我想把它分成两个记录,如

Vikash, Singh, 19-12-1982, 32, abc@email.com
Vikash, Singh, 19-12-1982, 32, def@email.com

我如何在Scala中执行此操作?

2 个答案:

答案 0 :(得分:4)

假设您的电子邮件存储在某种TraversableOnce中,您只需要运行flatMap

val rdd2 = rdd1.flatMap { case (first, last, dob, age, emails) => for {email <- emails} yield (first, last, dob, age, email) }

当我在本地运行时,我得到:

scala> val rdd1 = sc.parallelize(Seq(("Vikash", "Singh", "19-12-1982", 32, Seq("abc@email.com", "def@email.com"))))
...
scala> val rdd2 = rdd1.flatMap { case (first, last, dob, age, emails) => for {email <- emails} yield (first, last, dob, age, email) }
...
scala> rdd2.foreach(println)
...
(Vikash,Singh,19-12-1982,32,abc@email.com)
(Vikash,Singh,19-12-1982,32,def@email.com)

答案 1 :(得分:1)

根据@Rohan Aletty回答,如果您想使用map代替for loop

val rdd1 = sc.parallelize(Seq(("Vikash", "Singh", "19-12-1982", 32, 
                            Seq("abc@email.com", "def@email.com"))))
val rdd2 = rdd1.flatMap { case (first, last, dob, age, emails) => 
                            emails.map(email => (first, last, dob, age, email)) }

println(rdd2.count()) // => 2
rdd2.collect().foreach(println) // => (Vikash,Singh,19-12-1982,32,abc@email.com), (Vikash,Singh,19-12-1982,32,def@email.com)