如何在Hive SQL中将与时间相关的事件组合在一起

时间:2017-08-04 15:46:55

标签: sql hadoop hive hiveql

我有一个看起来像这个代表性数据集的数据集(它是this query的结果集):

time                          
2012-02-01 23:43:16.9088243 <--
2012-02-01 23:43:16.9093561
2012-02-01 23:43:16.9098879

2012-02-01 23:43:17.1018243 <--
2012-02-01 23:43:17.1023561
2012-02-01 23:43:17.1028879

2012-02-01 23:43:17.2018243 <--
2012-02-01 23:43:17.2023561
2012-02-01 23:43:17.2028879

结果包含数百万行,所以现在我们需要一种方法来细化它,以便我们分析它。

如果你注意到,上面例子的前三行是彼此的千分之一秒,但接下来的三行是十分之一秒,而后面的三行也是由一个十分之一秒。我已经添加了空行(不在原始数据中)来说明这一点。

我需要一个查询,识别那些距离上一个时间戳超过千分之一秒的时间戳。结果输出(假设第一组三个也是十分之一第二个远离前一个)将是:

2012-02-01 23:43:16.9088243
2012-02-01 23:43:17.1018243
2012-02-01 23:43:17.2018243

我知道我可能需要某种Row_Number功能和分区,但我无法完全理解它。

1 个答案:

答案 0 :(得分:1)

您可以使用Traceback (most recent call last): File "test_script.py", line 8, in <module> data = pd.concat(data1, ignore_index=True) File "/home/user/.local/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 206, in concat copy=copy) File "/home/user/.local/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 236, in __init__ objs = list(objs) File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 978, in __next__ return self.get_chunk() File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1042, in get_chunk return self.read(nrows=size) File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1005, in read ret = self._engine.read(nrows) File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1748, in read data = self._reader.read(nrows) File "pandas/_libs/parsers.pyx", line 893, in pandas._libs.parsers.TextReader.read (pandas/_libs/parsers.c:10885) File "pandas/_libs/parsers.pyx", line 966, in pandas._libs.parsers.TextReader._read_rows (pandas/_libs/parsers.c:11884) File "pandas/_libs/parsers.pyx", line 953, in pandas._libs.parsers.TextReader._tokenize_rows (pandas/_libs/parsers.c:11755) File "pandas/_libs/parsers.pyx", line 2184, in pandas._libs.parsers.raise_parser_error (pandas/_libs/parsers.c:28765) pandas.errors.ParserError: Error tokenizing data. C error: out of memory /opt/gridengine/default/Federation/spool/execd/kcompute030/job_scripts/5883517: line 10: 29990 Segmentation fault (core dumped) python3.6 test_script.py

lag()