Question

例如，如果我有一个包含transaction number和transaction date [作为timestamp]列的表格，我如何找出hourly basis上的交易总数？

是否有任何Spark sql函数可用于此类范围计算？

Answer 1

您可以使用from_unixtime功能。

val sqlContext = new SQLContext(sc)

import org.apache.spark.sql.functions._
import sqlContext.implicits._

val df = // your dataframe, assuming transaction_date is timestamp in seconds
df.select('transaction_number, hour(from_unixtime('transaction_date)) as 'hour)
      .groupBy('hour)
      .agg(count('transaction_number) as 'transactions)

结果：

+----+------------+
|hour|transactions|
+----+------------+
|  10|        1000|
|  12|        2000|
|  13|        3000|
|  14|        4000|
|  ..|        ....|
+----+------------+

Answer 2

在这里，我试图给出一些指针，而不是完整的代码，请看这个

Time Interval Literals : Using interval literals, it is possible to perform subtraction or addition of an arbitrary amount of time from a date or timestamp value. This representation can be useful when you want to add or subtract a time period from a fixed point in time. For example, users can now easily express queries like “Find all transactions that have happened during the past hour”. An interval literal is constructed using the following syntax: [sql]INTERVAL value unit[/sql]

以下是python的方式。您可以修改以下示例以匹配您的要求，即交易日期开始时间，相应的结束时间。在你的情况下，而不是id的交易号。

# Import functions.
from pyspark.sql.functions import *
# Create a simple DataFrame.
data = [
  ("2015-01-01 23:59:59", "2015-01-02 00:01:02", 1),
  ("2015-01-02 23:00:00", "2015-01-02 23:59:59", 2),
  ("2015-01-02 22:59:58", "2015-01-02 23:59:59", 3)]
df = sqlContext.createDataFrame(data, ["start_time", "end_time", "id"])
df = df.select(
  df.start_time.cast("timestamp").alias("start_time"),
  df.end_time.cast("timestamp").alias("end_time"),
  df.id)
# Get all records that have a start_time and end_time in the
# same day, and the difference between the end_time and start_time
# is less or equal to 1 hour.
condition = \
  (to_date(df.start_time) == to_date(df.end_time)) & \
  (df.start_time + expr("INTERVAL 1 HOUR") >= df.end_time)
df.filter(condition).show()
+———————+———————+—+
|start_time           |            end_time |id |
+———————+———————+—+
|2015-01-02 23:00:00.0|2015-01-02 23:59:59.0|2  |
+———————+———————+—+

使用此方法，您可以应用组功能查找您案例中的总交易数。

上面是python代码，scala怎么样？

上面使用的

expr function也可以在scala中使用

另请查看spark-scala-datediff-of-two-columns-by-hour-or-minute 这描述如下..

import org.apache.spark.sql.functions._
    val diff_secs_col = col("ts1").cast("long") - col("ts2").cast("long")
    val df2 = df1
      .withColumn( "diff_secs", diff_secs_col )
      .withColumn( "diff_mins", diff_secs_col / 60D )
      .withColumn( "diff_hrs",  diff_secs_col / 3600D )
      .withColumn( "diff_days", diff_secs_col / (24D * 3600D) )

Spark SQL - 如何每小时查找总交易数

2 个答案: