Question

我在Spark和Scala中有以下DataFrame：

group   nodeId   date
1       1        2016-10-12T12:10:00.000Z
1       2        2016-10-12T12:00:00.000Z
1       3        2016-10-12T12:05:00.000Z
2       1        2016-10-12T12:30:00.000Z
2       2        2016-10-12T12:35:00.000Z

我需要按group对记录进行分组，按date按升序对其进行排序，并生成顺序nodeId对。此外，date应转换为Unix纪元。

使用预期输出可以更好地解释这一点：

group   nodeId_1   nodeId_2   date
1       2          3          2016-10-12T12:00:00.000Z
1       3          1          2016-10-12T12:05:00.000Z
2       1          2          2016-10-12T12:30:00.000Z

这是我到目前为止所做的：

df
  .groupBy("group")
  .agg($"nodeId",$"date")
  .orderBy(asc("date"))

但我不知道如何创建nodeId对。

Answer 1

使用Window函数和lead内置函数创建对，以及to_utc_timestamp内置函数将日期转换为纪元日期，您可以从中受益。最后，您必须filter未配对的行，因为您在输出中不需要它们。

以下是上述解释的程序。为清晰起见，我使用了评论

import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("group").orderBy("date")    //defining window function grouping by group and ordering by date

import org.apache.spark.sql.functions._
df.withColumn("date", to_utc_timestamp(col("date"), "Asia/Kathmandu"))     //converting the date to epoch datetime you can choose other timezone as required
  .withColumn("nodeId_2", lead("nodeId", 1).over(windowSpec))  //using window for creating pairs
    .filter(col("nodeId_2").isNotNull)                   //filtering out the unpaired rows
    .select(col("group"), col("nodeId").as("nodeId_1"), col("nodeId_2"), col("date"))  //selecting as required final dataframe
  .show(false)

您应该根据需要获得最终dataframe

+-----+--------+--------+-------------------+
|group|nodeId_1|nodeId_2|date               |
+-----+--------+--------+-------------------+
|1    |2       |3       |2016-10-12 12:00:00|
|1    |3       |1       |2016-10-12 12:05:00|
|2    |1       |2       |2016-10-12 12:30:00|
+-----+--------+--------+-------------------+

我希望答案很有帮助

注意 以获取我使用Asia/Kathmandu作为时区的正确纪元日期。

Answer 2

如果我理解您的要求，您可以group使用<上的自我加入和nodeId上的val df = Seq( (1, 1, "2016-10-12T12:10:00.000Z"), (1, 2, "2016-10-12T12:00:00.000Z"), (1, 3, "2016-10-12T12:05:00.000Z"), (2, 1, "2016-10-12T12:30:00.000Z"), (2, 2, "2016-10-12T12:35:00.000Z") ).toDF("group", "nodeId", "date") df.as("df1").join( df.as("df2"), $"df1.group" === $"df2.group" && $"df1.nodeId" < $"df2.nodeId" ).select( $"df1.group", $"df1.nodeId", $"df2.nodeId", when($"df1.date" < $"df2.date", $"df1.date").otherwise($"df2.date").as("date") ) // +-----+------+------+------------------------+ // |group|nodeId|nodeId|date | // +-----+------+------+------------------------+ // |1 |1 |3 |2016-10-12T12:05:00.000Z| // |1 |1 |2 |2016-10-12T12:00:00.000Z| // |1 |2 |3 |2016-10-12T12:00:00.000Z| // |2 |1 |2 |2016-10-12T12:30:00.000Z| // +-----+------+------+------------------------+不等式条件：

showRewardedAds() {
      const rewardedConfig: AdMobFreeRewardVideoConfig = {
        id: "ID goes here...",
        isTesting: false
      }
      this.adMobFree.rewardVideo.config(rewardedConfig);
      this.adMobFree.rewardVideo.prepare().then((data:any)=>{
        // HERE YOU WILL NEED TO MAKE THE SAME THING
        // catching the Error or the Success of the Promise
        this.adMobFree.rewardVideo.show()
      })
      .catch((e:Error)=>{
         console.log("Error ",e);
      });
  }

如何在Spark中创建节点对？

2 个答案: