SCALA:读取文本文件并创建它的元组

时间:2017-08-19 06:08:23

标签: scala apache-spark tuples

如何从以下存在的RDD创建元组?

// reading a text file "b.txt" and creating RDD 
val rdd = sc.textFile("/home/training/desktop/b.txt") 

b.txt数据集 - >

 Ankita,26,BigData,newbie
 Shikha,30,Management,Expert

1 个答案:

答案 0 :(得分:5)

如果您打算override var supportedInterfaceOrientations: UIInterfaceOrientationMask { return UIInterfaceOrientationMask.landscapeLeft } ,那么您可以执行以下操作

Array[Tuples4]

然后,您可以scala> val rdd = sc.textFile("file:/home/training/desktop/b.txt") rdd: org.apache.spark.rdd.RDD[String] = file:/home/training/desktop/b.txt MapPartitionsRDD[5] at textFile at <console>:24 scala> val arrayTuples = rdd.map(line => line.split(",")).map(array => (array(0), array(1), array(2), array(3))).collect arrayTuples: Array[(String, String, String, String)] = Array((" Ankita",26,BigData,newbie), (" Shikha",30,Management,Expert))

访问每个字段
tuples

<强>更新

如果您有可变大小的输入文件

scala> arrayTuples.map(x => println(x._3))
BigData
Management
res4: Array[Unit] = Array((), ())

您可以将匹配案例模式匹配写为

Ankita,26,BigData,newbie
Shikha,30,Management,Expert
Anita,26,big

再次更新

正如@eliasah所指出的,上述程序是使用scala> val arrayTuples = rdd.map(line => line.split(",") match { | case Array(a, b, c, d) => (a,b,c,d) | case Array(a,b,c) => (a,b,c) | }).collect arrayTuples: Array[Product with Serializable] = Array((Ankita,26,BigData,newbie), (Shikha,30,Management,Expert), (Anita,26,big)) 的不良做法。根据他的建议,我们应该知道输入数据的最大元素,并使用以下逻辑,我们为无元素分配默认值

product iterator

正如@philantrovert指出的那样,如果我们不使用val arrayTuples = rdd.map(line => line.split(",")).map(array => (Try(array(0)) getOrElse("Empty"), Try(array(1)) getOrElse(0), Try(array(2)) getOrElse("Empty"), Try(array(3)) getOrElse("Empty"))).collect

,我们可以通过以下方式验证输出
REPL

导致

arrayTuples.foreach(println)