火花图减少条件

时间:2016-04-04 15:08:51

标签: scala apache-spark

假设这些是我的CSV文件:

attr1;attr2
11111;MOC
22222;MTC
11111;MOC
22222;MOC
33333;MMS

当attr2 = MOC时,我希望在第一列中出现次数。像这样:

(11111,2)
(22222,1)

我试过了:

val sc = new SparkContext(conf)
val textFile = sc.textFile(args(0))

val data = textFile.map(line => line.split(";").map(elem => elem.trim))
val header = new SimpleCSVHeader(data.take(1)(0))

val rows = data.filter(line => header(line,"attr1") != "attr1")
val attr1 = rows.map(row => header(row,"attr1"))
val attr2 = rows.map(row => header(row,"attr2"))
attr1.map( k => (k,1) ).reduceByKey(_+_)

attr1.foreach (println)

如何在我的代码中添加条件? 我的代码的结果是:

(11111,2)
(22222,2)
(33333,1)

1 个答案:

答案 0 :(得分:0)

使用过滤器(再次):

val rows = data
  .filter(line => header(line,"attr1") != "attr1")
  .filter(line => header(line,"attr2") == "MOC")

然后像以前一样继续......