Question

我有这样的分隔符文件。

 2:-31:20063:28:0:1496745908:3879:0:0:0:0:6:4:3
 2:-41:20063:28:0:1496745909:3879:0:0:0:0:6:4:3
 2:-35:20063:28:0:1496745910:3879:0:0:0:0:6:4:3
 2:-44:20063:28:0:1496745911:3879:0:0:0:0:6:4:3
 2:-41:20063:28:0:1496745912:3879:0:0:0:0:6:4:3 
 2:-51:20063:28:0:1496745913:3879:0:0:0:0:6:4:3
 2:-52:20063:28:0:1496745914:3879:0:0:0:0:6:4:3
 2:-61:20063:28:0:1496745915:3879:0:0:0:0:6:4:3

我想读取此文件并将其存储在数组中。我想访问每一列以进行聚合。我试过这样的。

def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Proximity Filter").setMaster("local[2]").set("spark.executor.memory", "1g")
val sc = new SparkContext(conf)
val input = sc.textFile("/home/arun/Desktop/part-r-00000")
val wordCount = input.flatMap(line => line.split(":"))
val input1 = wordCount.take(0)
System.out.print(input1)  
}

Answer 1

将您的flatMap更改为map，您应该没问题

val wordCount = input.map(line => line.split(":"))
wordCount.foreach(array => println(array(0), array(1), array(2), array(3), array(4), array(5), array(6), array(7), array(8), array(9), array(10), array(11), array(12)))

你应该输出

( 2,-31,20063,28,0,1496745908,3879,0,0,0,0,6,4)
( 2,-41,20063,28,0,1496745909,3879,0,0,0,0,6,4)
( 2,-35,20063,28,0,1496745910,3879,0,0,0,0,6,4)
( 2,-44,20063,28,0,1496745911,3879,0,0,0,0,6,4)
( 2,-41,20063,28,0,1496745912,3879,0,0,0,0,6,4)
( 2,-51,20063,28,0,1496745913,3879,0,0,0,0,6,4)
( 2,-52,20063,28,0,1496745914,3879,0,0,0,0,6,4)
( 2,-61,20063,28,0,1496745915,3879,0,0,0,0,6,4)

将您的flatMap更改为map会使用字符串数组的rdd生成您

scala> val wordCount = input.map(line => line.split(":"))
wordCount: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[2] at map at <console>:26

而使用flatMap会给你的rdd字符串

scala> val wordCount = input.flatMap(line => line.split(":"))
wordCount: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at flatMap at <console>:26

文件读取并将其存储在数组spark中

1 个答案: