Question

我正在学习如何使用Apache Spark，我试图从数据集中获取每小时的平均温度。我尝试使用的数据集来自存储在csv中的天气信息。我无法找到如何首先读取csv文件，然后计算每小时的平均温度。

在spark文档中，我使用示例Scala行读取文件。

val textFile = sc.textFile("README.md")

我已经为下面的数据文件提供了链接。我正在使用名为JCMB_2014.csv的文件，因为它是涵盖所有月份的最新文件。

Weather Data

编辑：我到目前为止尝试的代码是：

class SimpleCSVHeader(header:Array[String]) extends Serializable {
  val index = header.zipWithIndex.toMap
  def apply(array:Array[String], key:String):String = array(index(key))
}

val csv = sc.textFile("JCMB_2014.csv")
val data = csv.map(line => line.split(",").map(elem => elem.trim))
val header = new SimpleCSVHeader(data.take(1)(0)) // we build our header 

val header = new SimpleCSVHeader(data.take(1)(0))
val rows = data.filter(line => header(line,"date-time") != "date-time")
val users = rows.map(row => header(row,"date-time")
val usersByHits = rows.map(row => header(row,"date-time") -> header(row,"surface temperature (C)").toInt)

Answer 1

以下是按小时计算平均值的示例代码

步骤1：读取文件，过滤器标题，提取时间和临时列

scala> val hourlyTemps = lines.map(line=>line.split(",")).filter(entries=>(!"time".equals(entries(3)))).map(entries=>(entries(3).toInt/60,(entries(8).toFloat,1)))
    scala> hourlyTemps.take(1)
    res25: Array[(Int, (Float, Int))] = Array((9,(10.23,1)))

（时间/ 60）丢弃分钟并且只保留数小时

步骤2：聚合温度和不发生

scala> val aggregateTemps=hourlyTemps.reduceByKey((a,b)=>(a._1+b._1,a._2+b._2))
scala> aggreateTemps.take(1)
res26: Array[(Int, (Double, Int))] = Array((34,(8565.25,620)))

步骤2：使用总次数和不出现次数计算平均值找到下面的最终结果。

val avgTemps=aggregateTemps.map(tuple=>(tuple._1,tuple._2._1/tuple._2._2))
scala> avgTemps.collect
res28: Array[(Int, Float)] = Array((34,13.814922), (4,11.743354), (16,14.227251), (22,15.770312), (28,15.5324545), (30,15.167026), (14,13.177828), (32,14.659948), (36,12.865237), (0,11.994799), (24,15.662579), (40,12.040322), (6,11.398838), (8,11.141323), (12,12.004652), (38,12.329914), (18,15.020147), (20,15.358524), (26,15.631921), (10,11.192643), (2,11.848178), (13,12.616284), (19,15.198371), (39,12.107664), (15,13.706351), (21,15.612191), (25,15.627121), (29,15.432097), (11,11.541124), (35,13.317129), (27,15.602408), (33,14.220147), (37,12.644306), (23,15.83412), (1,11.872819), (17,14.595772), (3,11.78971), (7,11.248139), (9,11.049844), (31,14.901464), (5,11.59693))

Answer 2

您可能希望提供CSV文件的结构定义，并将RDD转换为DataFrame，如in the documentation所述。数据框提供了一整套有用的预定义统计函数，以及编写一些简单自定义函数的可能性。然后，您将能够计算平均值：

dataFrame.groupBy(<your columns here>).agg(avg(<column to compute average>)

使用Apache Spark

2 个答案: