我是scala的新手,我想为一个数据集迭代三个loos并执行一些分析。例如我的数据如下:
Sample.csv
1,100,0,NA,0,1,0,Friday,1,5
1,100,0,NA,0,1,0,Wednesday,1,9
1,100,1,NA,0,1,0,Friday,1,5
1,100,2,NA,0,1,0,Friday,1,5
1,101,0,NA,0,1,0,Friday,1,5
1,101,1,NA,0,1,0,Friday,1,5
1,101,2,NA,0,1,0,Friday,1,5
1,102,0,NA,0,1,0,Friday,1,5
1,102,1,NA,0,1,0,Friday,1,5
1,102,2,NA,0,1,0,Friday,1,5
所以现在我看了下面的内容:
val data = sc.textFile("C:/users/ricky/Data.csv")
现在我需要为scala中的前三列实现一个过滤器来过滤整个数据的子集并进行一些分析。例如,前三列是要过滤的列。所以我有第一列(1)的一个值,第二列(100,101,102)的3个值和第三列(0,1,2)的3个值。所以我现在需要运行过滤器来提供整个数据的子集。使用下面的循环
很好for {
i <- 1
j <- 100 to 102
k <- 1 to 2
}
应该需要子集数据,如
1,100,0,NA,0,1,0,Friday,1,5
1,100,0,NA,0,1,0,Wednesday,1,9
where i=1 ,j=100,and k=0
以及
1,102,2,NA,0,1,0,Friday,1,5
where i=1 ,j=102,and k=2
如何在Scala中运行数据(我从CSV中读取)。
答案 0 :(得分:2)
从文本csv文件中读取后,您可以使用filter
过滤所需的数据
val tempData = data.map(line => line.split(","))
tempData.filter(array => array(0) == "1" && array(1) == "100" && array(2) == "0").foreach(x => println(x.mkString(",")))
这将为您提供结果
1,100,0,NA,0,1,0,Friday,1,5
1,100,0,NA,0,1,0,Wednesday,1,9
您可以对其他案例执行相同操作
Dataframe apis
您可以使用dataframe
api来简化,优于rdd等等。第一步是将csv读作dataframe
作为
val df = sqlContext.read.format("com.databricks.spark.csv").load("path to csv file")
你会有
+---+---+---+---+---+---+---+---------+---+---+
|_c0|_c1|_c2|_c3|_c4|_c5|_c6|_c7 |_c8|_c9|
+---+---+---+---+---+---+---+---------+---+---+
|1 |100|0 |NA |0 |1 |0 |Friday |1 |5 |
|1 |100|0 |NA |0 |1 |0 |Wednesday|1 |9 |
|1 |100|1 |NA |0 |1 |0 |Friday |1 |5 |
|1 |100|2 |NA |0 |1 |0 |Friday |1 |5 |
|1 |101|0 |NA |0 |1 |0 |Friday |1 |5 |
|1 |101|1 |NA |0 |1 |0 |Friday |1 |5 |
|1 |101|2 |NA |0 |1 |0 |Friday |1 |5 |
|1 |102|0 |NA |0 |1 |0 |Friday |1 |5 |
|1 |102|1 |NA |0 |1 |0 |Friday |1 |5 |
|1 |102|2 |NA |0 |1 |0 |Friday |1 |5 |
+---+---+---+---+---+---+---+---------+---+---+
然后你可以使用filter
api作为rdd
import sqlContext.implicits._
val df1 = df.filter($"_c0" === "1" && $"_c1" === "100" && $"_c2" === "0")
你应该
+---+---+---+---+---+---+---+---------+---+---+
|_c0|_c1|_c2|_c3|_c4|_c5|_c6|_c7 |_c8|_c9|
+---+---+---+---+---+---+---+---------+---+---+
|1 |100|0 |NA |0 |1 |0 |Friday |1 |5 |
|1 |100|0 |NA |0 |1 |0 |Wednesday|1 |9 |
+---+---+---+---+---+---+---+---------+---+---+
您甚至可以定义schema
以获得所需的列名称。
已修改
在下面回答你的评论,这完全取决于你输出的内容
scala> val temp = tempData.filter(array => array(0) == "1" && array(1).toInt == "100" && array(2).toInt == "0").map(x => x.mkString(","))
temp: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[12] at map at <console>:28
scala> tempData.filter(array => array(0) == "1" && array(1).toInt == "100" && array(2).toInt == "0")
res9: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[13] at filter at <console>:29
我希望它清楚。