我有这样的分隔符文件。
2:-31:20063:28:0:1496745908:3879:0:0:0:0:6:4:3
2:-41:20063:28:0:1496745909:3879:0:0:0:0:6:4:3
2:-35:20063:28:0:1496745910:3879:0:0:0:0:6:4:3
2:-44:20063:28:0:1496745911:3879:0:0:0:0:6:4:3
2:-41:20063:28:0:1496745912:3879:0:0:0:0:6:4:3
2:-51:20063:28:0:1496745913:3879:0:0:0:0:6:4:3
2:-52:20063:28:0:1496745914:3879:0:0:0:0:6:4:3
2:-61:20063:28:0:1496745915:3879:0:0:0:0:6:4:3
我想读取此文件并将其存储在数组中。我想访问每一列以进行聚合。我试过这样的。
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Proximity Filter").setMaster("local[2]").set("spark.executor.memory", "1g")
val sc = new SparkContext(conf)
val input = sc.textFile("/home/arun/Desktop/part-r-00000")
val wordCount = input.flatMap(line => line.split(":"))
val input1 = wordCount.take(0)
System.out.print(input1)
}
答案 0 :(得分:0)
将您的flatMap
更改为map
,您应该没问题
val wordCount = input.map(line => line.split(":"))
wordCount.foreach(array => println(array(0), array(1), array(2), array(3), array(4), array(5), array(6), array(7), array(8), array(9), array(10), array(11), array(12)))
你应该输出
( 2,-31,20063,28,0,1496745908,3879,0,0,0,0,6,4)
( 2,-41,20063,28,0,1496745909,3879,0,0,0,0,6,4)
( 2,-35,20063,28,0,1496745910,3879,0,0,0,0,6,4)
( 2,-44,20063,28,0,1496745911,3879,0,0,0,0,6,4)
( 2,-41,20063,28,0,1496745912,3879,0,0,0,0,6,4)
( 2,-51,20063,28,0,1496745913,3879,0,0,0,0,6,4)
( 2,-52,20063,28,0,1496745914,3879,0,0,0,0,6,4)
( 2,-61,20063,28,0,1496745915,3879,0,0,0,0,6,4)
将您的flatMap
更改为map
会使用字符串数组的rdd生成您
scala> val wordCount = input.map(line => line.split(":"))
wordCount: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[2] at map at <console>:26
而使用flatMap
会给你的rdd字符串
scala> val wordCount = input.flatMap(line => line.split(":"))
wordCount: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at flatMap at <console>:26