我正在尝试计算2行之间的时差并应用this SO问题的解决方案。但是我得到了一个例外:
> org.apache.hive.service.cli.HiveSQLException: Error while compiling
> statement: FAILED: SemanticException Failed to breakup Windowing
> invocations into Groups. At least 1 group must only depend on input
> columns. Also check for circular dependencies. Underlying error:
> Expecting left window frame boundary for function
> LAG((tok_table_or_col time), 1, 0) Window
> Spec=[PartitioningSpec=[partitionColumns=[(tok_table_or_col
> client_id)]orderColumns=[(tok_table_or_col time) ASC
> NULLS_FIRST]]window(type=ROWS, start=1 PRECEDING, end=currentRow)] as
> LAG_window_0 to be unbounded. Found : 1
HiveQL:
SELECT id, loc, LAG(time, 1, 0) OVER (PARTITION BY id, loc ORDER BY time ROWS 1 PRECEDING) - time AS response_time FROM mytable
如何修复此问题?有什么问题?
编辑:
示例数据:
id loc time
0 1 1414250523591
0 1 1414250523655
1 2 1414250523655
1 2 1414250523661
1 3 1414250523661
1 3 1414250523662
我想要的是具有相同id和loc的行之间的时间差异(总是2对)。
编辑2:我还应该提到我是hadoop / hive生态系统的新手。因为错误说,窗口应该是无界的。所以我刚刚删除了ROWS子句,现在至少它正在做一些事情,但它仍然是错误的。所以我只想检查LAG值实际是什么:
SELECT id, loc, LAG(time, 1) OVER (PARTITION BY id, loc ORDER BY time) AS lag_col FROM mytable
我得到这个作为输出:
id loc lag_col
1 2 null
1 2 -1
1 3 null
1 3 -1
null是清楚的,因为我删除了默认值,但为什么-1?时间列中的大值是否会导致某些溢出?列被定义为bigint,所以它实际上应该没有问题但是在查询期间可能会转换为int吗?