在Spark中将新记录添加到另一个之前

时间:2019-03-04 12:21:32

标签: scala apache-spark

我有一个数据框:

| ID | TIMESTAMP | VALUE |
  1     15:00:01    3
  1     17:04:02    2

我想在之前为Spark-Scala添加新记录,同时值为2时减去1秒。

输出为:

| ID | TIMESTAMP | VALUE |
  1     15:00:01    3
  1     17:04:01    2
  1     17:04:02    2

谢谢

2 个答案:

答案 0 :(得分:0)

您可以引入一个新的列数组-当value = 2时,然后是Array(-1,0),否则是Array(0),然后爆炸该列并添加时间戳(以秒为单位)。下面的一个应该为您工作。检查一下:

scala> val df = Seq((1,"15:00:01",3),(1,"17:04:02",2)).toDF("id","timestamp","value")
df: org.apache.spark.sql.DataFrame = [id: int, timestamp: string ... 1 more field]

scala> val df2 = df.withColumn("timestamp",'timestamp.cast("timestamp"))
df2: org.apache.spark.sql.DataFrame = [id: int, timestamp: timestamp ... 1 more field]

scala> df2.show(false)
+---+-------------------+-----+
|id |timestamp          |value|
+---+-------------------+-----+
|1  |2019-03-04 15:00:01|3    |
|1  |2019-03-04 17:04:02|2    |
+---+-------------------+-----+


scala> val df3 = df2.withColumn("newc", when($"value"===lit(2),lit(Array(-1,0))).otherwise(lit(Array(0))))
df3: org.apache.spark.sql.DataFrame = [id: int, timestamp: timestamp ... 2 more fields]

scala> df3.show(false)
+---+-------------------+-----+-------+
|id |timestamp          |value|newc   |
+---+-------------------+-----+-------+
|1  |2019-03-04 15:00:01|3    |[0]    |
|1  |2019-03-04 17:04:02|2    |[-1, 0]|
+---+-------------------+-----+-------+


scala> val df4 = df3.withColumn("c_explode",explode('newc)).withColumn("timestamp2",to_timestamp(unix_timestamp('timestamp)+'c_explode))
df4: org.apache.spark.sql.DataFrame = [id: int, timestamp: timestamp ... 4 more fields]

scala> df4.select($"id",$"timestamp2",$"value").show(false)
+---+-------------------+-----+
|id |timestamp2         |value|
+---+-------------------+-----+
|1  |2019-03-04 15:00:01|3    |
|1  |2019-03-04 17:04:01|2    |
|1  |2019-03-04 17:04:02|2    |
+---+-------------------+-----+


scala>

如果您只想要时间,那么可以做

scala> df4.withColumn("timestamp",from_unixtime(unix_timestamp('timestamp2),"HH:mm:ss")).select($"id",$"timestamp",$"value").show(false)
+---+---------+-----+
|id |timestamp|value|
+---+---------+-----+
|1  |15:00:01 |3    |
|1  |17:04:01 |2    |
|1  |17:04:02 |2    |
+---+---------+-----+

答案 1 :(得分:0)

您需要一个.flatMap()

  

类似于映射,但是每个输入项都可以映射到0个或多个输出项(因此func应该返回Seq而不是单个项)。

val data = (spark.createDataset(Seq(
    (1, "15:00:01", 3),
    (1, "17:04:02", 2)
  )).toDF("ID", "TIMESTAMP_STR", "VALUE")
  .withColumn("TIMESTAMP", $"TIMESTAMP_STR".cast("timestamp").as("TIMESTAMP"))
  .drop("TIMESTAMP_STR")
  .select("ID", "TIMESTAMP", "VALUE")
)

data.as[(Long, java.sql.Timestamp, Long)].flatMap(r => {
  if(r._3 == 2) {
    Seq(
      (r._1, new java.sql.Timestamp(r._2.getTime() - 1000L), r._3),
      (r._1, r._2, r._3)
    )
  } else {
    Some(r._1, r._2, r._3)
  }
}).toDF("ID", "TIMESTAMP", "VALUE").show()

这将导致:

+---+-------------------+-----+
| ID|           TIMESTAMP|VALUE|
+---+-------------------+-----+
|  1|2019-03-04 15:00:01|    3|
|  1|2019-03-04 17:04:01|    2|
|  1|2019-03-04 17:04:02|    2|
+---+-------------------+-----+