如何根据Scala中的三列过滤数据

时间:2017-07-12 01:07:39

标签: scala hadoop apache-spark

我是scala的新手,我想为一个数据集迭代三个loos并执行一些分析。例如我的数据如下:

Sample.csv

1,100,0,NA,0,1,0,Friday,1,5
1,100,0,NA,0,1,0,Wednesday,1,9
1,100,1,NA,0,1,0,Friday,1,5
1,100,2,NA,0,1,0,Friday,1,5
1,101,0,NA,0,1,0,Friday,1,5
1,101,1,NA,0,1,0,Friday,1,5
1,101,2,NA,0,1,0,Friday,1,5
1,102,0,NA,0,1,0,Friday,1,5
1,102,1,NA,0,1,0,Friday,1,5
1,102,2,NA,0,1,0,Friday,1,5

所以现在我看了下面的内容:

val data = sc.textFile("C:/users/ricky/Data.csv")

现在我需要为scala中的前三列实现一个过滤器来过滤整个数据的子集并进行一些分析。例如,前三列是要过滤的列。所以我有第一列(1)的一个值,第二列(100,101,102)的3个值和第三列(0,1,2)的3个值。所以我现在需要运行过滤器来提供整个数据的子集。使用下面的循环

很好
for {
  i <- 1
  j <- 100 to 102
  k <- 1 to 2
}

应该需要子集数据,如

1,100,0,NA,0,1,0,Friday,1,5
1,100,0,NA,0,1,0,Wednesday,1,9

where i=1 ,j=100,and k=0

以及

1,102,2,NA,0,1,0,Friday,1,5

where i=1 ,j=102,and k=2

如何在Scala中运行数据(我从CSV中读取)。

1 个答案:

答案 0 :(得分:2)

从文本csv文件中读取后,您可以使用filter过滤所需的数据

val tempData = data.map(line => line.split(","))
tempData.filter(array => array(0) == "1" && array(1) == "100" && array(2) == "0").foreach(x => println(x.mkString(",")))

这将为您提供结果

1,100,0,NA,0,1,0,Friday,1,5
1,100,0,NA,0,1,0,Wednesday,1,9

您可以对其他案例执行相同操作

Dataframe apis

您可以使用dataframe api来简化,优于rdd等等。第一步是将csv读作dataframe作为

val df = sqlContext.read.format("com.databricks.spark.csv").load("path to csv file")

你会有

+---+---+---+---+---+---+---+---------+---+---+
|_c0|_c1|_c2|_c3|_c4|_c5|_c6|_c7      |_c8|_c9|
+---+---+---+---+---+---+---+---------+---+---+
|1  |100|0  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |100|0  |NA |0  |1  |0  |Wednesday|1  |9  |
|1  |100|1  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |100|2  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |101|0  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |101|1  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |101|2  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |102|0  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |102|1  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |102|2  |NA |0  |1  |0  |Friday   |1  |5  |
+---+---+---+---+---+---+---+---------+---+---+

然后你可以使用filter api作为rdd

import sqlContext.implicits._
val df1 = df.filter($"_c0" === "1" && $"_c1" === "100" && $"_c2" === "0")
你应该

+---+---+---+---+---+---+---+---------+---+---+
|_c0|_c1|_c2|_c3|_c4|_c5|_c6|_c7      |_c8|_c9|
+---+---+---+---+---+---+---+---------+---+---+
|1  |100|0  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |100|0  |NA |0  |1  |0  |Wednesday|1  |9  |
+---+---+---+---+---+---+---+---------+---+---+

您甚至可以定义schema以获得所需的列名称。

已修改

在下面回答你的评论,这完全取决于你输出的内容

scala> val temp = tempData.filter(array => array(0) == "1" && array(1).toInt == "100" && array(2).toInt == "0").map(x => x.mkString(","))
temp: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[12] at map at <console>:28

scala> tempData.filter(array => array(0) == "1" && array(1).toInt == "100" && array(2).toInt == "0")
res9: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[13] at filter at <console>:29

我希望它清楚。