我的源CSV具有时间列,例如
Time Attempt
12.07.2018 00:00:00 50
12.07.2018 00:15:00 60
...
13.07.2018 00:00:00 100
13.07.2018 00:15:00 30
我想按dd / mm / yyyy HH24分组。在SQL中,我们可以使用to_date('Time','dd/mm/yyyy hh24')
,但在Spark中我已经尝试过,但是下面显示了错误。
请指教。非常感谢。
val dfAgg = df.select(
unix_timestamp($"time", "yyyy/MM/dd HH:mm:ss").cast(TimestampType).as("timestamp")
,unix_timestamp($"time", "yyyy/MM/dd HH").cast(TimestampType).as("time2")
,to_date($"time","HH").as("time3")
,to_date($"time","yyyy/MM/dd").as("time4")
)
<console>:94: error: too many arguments for method to_date: (e: org.apache.spark.sql.Column)org.apache.spark.sql.Column
,to_date($"time","HH").as("time3")
^
<console>:95: error: too many arguments for method to_date: (e: org.apache.spark.sql.Column)org.apache.spark.sql.Column
,to_date($"time","yyyy/MM/dd").as("time4")
答案 0 :(得分:0)
//what principal the master/region. servers use.
config.set("hbase.regionserver.kerberos.principal", "hbase/_HOST@FIELD.HORTONWORKS.COM");
config.set("hbase.regionserver.keytab.file", "src/hbase.service.keytab");
// this is needed even if you connect over rpc/zookeeper
config.set("hbase.master.kerberos.principal", "hbase/_HOST@FIELD.HORTONWORKS.COM");
config.set("hbase.master.keytab.file", "src/hbase.service.keytab");
答案 1 :(得分:0)
函数“ to_timestamp”可用于将字符串转换为时间戳:
val data = List(
("12.07.2018 00:00:00", 50),
("12.07.2018 00:15:00", 60),
("13.07.2018 00:00:00", 100),
("13.07.2018 00:15:00", 30))
val df = data.toDF("time", "value").select(
to_timestamp($"time", "dd.MM.yyyy HH:mm:ss")
)
df.printSchema()
df.show(false)
输出:
root
|-- to_timestamp(`time`, 'dd.MM.yyyy HH:mm:ss'): timestamp (nullable = true)
+-------------------------------------------+
|to_timestamp(`time`, 'dd.MM.yyyy HH:mm:ss')|
+-------------------------------------------+
|2018-07-12 00:00:00 |
|2018-07-12 00:15:00 |
|2018-07-13 00:00:00 |
|2018-07-13 00:15:00 |
+-------------------------------------------+
答案 2 :(得分:0)
您收到错误消息是因为您的Spark版本低于2.2.0
Spark 2.2.0引入了def to_date(e: Column, fmt: String)
。检查API documentation
您可以使用to_timestamp()
函数来创建小时窗口:
val df = data.toDF("time", "value")
.select('time, 'value, to_timestamp('time, "dd.MM.yyyy HH") as "hour_window")
.groupBy('hour_window).sum("value").show
返回:
+-------------------+----------+
| hour_window|sum(value)|
+-------------------+----------+
|2018-07-13 00:00:00| 130|
|2018-07-12 00:00:00| 110|
+-------------------+----------+