我是新来的火花。我正在遵循文档中的一些基本示例。
我有一个这样的csv文件:(简化版,真正的有近40,000行)
date,category
19900108,apples
19900108,apples
19900308,peaches
19900408,peaches
19900508,pears
19910108,pears
19910108,peaches
19910308,apples
19910408,apples
19910508,apples
19920108,pears
19920108,peaches
19920308,apples
19920408,peaches
19920508,pears
这段scala代码适用于计算类别总计
val textFile = sc.textFile("sample.csv")
textFile.filter(line => line.contains("1990")).filter(line =>line.contains("peaches")).count()
textFile.filter(line => line.contains("1990")).filter(line => line.contains("apples")).count()
textFile.filter(line => line.contains("1990")).filter(line => line.contains("pears")).count()
循环遍历每一行的最佳方法是什么,按年添加类别总计,以便我最终编写如下的csv文件:
date,apples,peaches,pears
1990,2,2,1
1991,3,1,1
1992,1,2,2
任何帮助都将不胜感激。
答案 0 :(得分:1)
//Create Spark SQL Context
val sqlContext = new SQLContext(sc)
//read csv
var df = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("sample.csv")
df = df.withColumn("year", df.col("date").substr(0,4))
df = df.groupBy("year").pivot("category").agg("category"->"count")
df.withColumn("total", df.col("apples").+(df.col("peaches")).+(df.col("pears"))).show()
//Dependency required:
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.10</artifactId>
<version>1.4.0</version>
</dependency>