Question

我有一个类似于以下表格：

    +----------+----+--------------+-------------+
    |      Date|Hour|       Weather|Precipitation|
    +----------+----+--------------+-------------+
    |2013-07-01|   0|          null|         null|
    |2013-07-01|   3|          null|         null|
    |2013-07-01|   6|         clear|trace of p...|
    |2013-07-01|   9|          null|         null|
    |2013-07-01|  12|          null|         null|
    |2013-07-01|  15|          null|         null|
    |2013-07-01|  18|          rain|         null|
    |2013-07-01|  21|          null|         null|
    |2013-07-02|   0|          null|         null|
    |2013-07-02|   3|          null|         null|
    |2013-07-02|   6|          rain|low precip...|
    |2013-07-02|   9|          null|         null|
    |2013-07-02|  12|          null|         null|
    |2013-07-02|  15|          null|         null|
    |2013-07-02|  18|          null|         null|
    |2013-07-02|  21|          null|         null|
    +----------+----+--------------+-------------+

这个想法是分别在6、18和6小时分别填充Weather和Precipitation列中的值。由于此表说明了DataFrame结构，因此进行简单的迭代似乎是不合理的。我尝试过这样的事情：

//_weather stays for the table mentioned
def fillEmptyCells: Unit = {
    val hourIndex = _weather.schema.fieldIndex("Hour")
    val dateIndex = _weather.schema.fieldIndex("Date")
    val weatherIndex = _weather.schema.fieldIndex("Weather")
    val precipitationIndex = _weather.schema.fieldIndex("Precipitation")

    val days = _weather.select("Date").distinct().rdd
    days.foreach(x => {
      val day = _weather.where("Date == $x(0)")
      val dayValues = day.where("Hour == 6").first()
      val weather = dayValues.getString(weatherIndex)
      val precipitation = dayValues.getString(precipitationIndex)
      day.rdd.map(y => (_(0), _(1), weather, precipitation))
    })
  }

但是，这段丑陋的代码似乎很臭，因为它遍历RDD而不是分布式处理。它也必须由碎片组成一个新的RDD或DataFrame（可能不存在问题）（我不知道该怎么做）。有没有更优雅，更简单的方法来解决此任务？

Answer 1

假设您可以通过组合use strict 'subs';和timestamp轻松创建Date列，那么我接下来要做的是：

将此Hour（可能以毫秒或秒为单位）转换为timestamp：hourTimestamp）吗？
创建3列对应于不同的可能时滞（3,6,9）
.withColumn("hourTimestamp", $"timestamp" // 3600这三列+原始一列

以下是coalesce的代码（对Weather进行相同的操作）：

Precipitation

用DataFrame中的重复项填充空单元格

1 个答案: