在Spark SQL中正确实现滞后功能

时间:2019-04-29 05:59:04

标签: java apache-spark

我写了一个火花作业(job-1),它从csv文件中获取输入,按用户ID对记录进行分区,然后将它们合并为100个快速压缩的镶木地板文件。现在,我正在处理第二个作业(作业2),它将作业1的输出作为输入并进行一些计算。

job-2的输入有2列,分别是纬度和经度(我感兴趣)以及用户ID和时间戳。我想创建两个新列,它们会将经度和纬度偏移1。这对于计算使用 Haversine 公式计算的距离是必需的。我正在针对Spark 2.4.0使用Java DataFrame API。该代码段如下所示:

Dataset<Row> ds1 = spark.read().parquet(path);
WindowSpec window = Window.partitionBy("idvalue").orderBy("timestamp");

ds1.na().drop(new String[] { "timestamp" });
Dataset<Row> ds2 = ds1.withColumn("hour", functions.hour(ds1.col("timestamp")))
      .withColumn("day", functions.dayofmonth(ds1.col("timestamp")))
      .withColumn("date", functions.to_date(ds1.col("timestamp")))
      .withColumn("locationlat-shifted", functions.lag(ds1.col("locationlat"), 1).over(window))
      .withColumn("locationlon-shifted", functions.lag(ds1.col("locationlon"), 1).over(window));
ds2.show(20);

我做对了吗?原因是,显示输出需要很长时间(大约10分钟)。输入数据约为2.5 GB。我不确定是否利用了事实,即数据已经按照输入实木复合地板文件中的用户标识进行了分组。在标准输出中,我看到了以下两个警告:

WARN HttpMethodReleaseInputStream: Attempting to release HttpMethod in finalize() as its response data stream has gone out of scope. This attempt will not always succeed and cannot be relied upon! Please ensure response data streams are always fully consumed or closed to avoid HTTP connection starvation.
WARN HttpMethodReleaseInputStream: Successfully released HttpMethod in finalize(). You were lucky this time... Please ensure response data streams are always fully consumed or closed.

在创建windowspec时,我正在按用户ID进行分区,但不确定是否可以继续。感谢您提供任何见解,因为我对spark-sql中的窗口操作不太熟悉。

0 个答案:

没有答案