Spark SQL时间戳过滤器比较无法正常工作

时间:2018-06-07 08:09:04

标签: apache-spark timestamp apache-spark-sql

  

我无法披露因客户而导致的表格名称   合规性。

我们正在使用Spark 1.6.0。我有一张桌子"测试"使用timestamp列" cal_cymd"和另一个与数量相关的列,即"数量"。当我使用" ="运行查询时运算符在where子句中然后它返回如下行:

hc.sql("SELECT SUM(A.qty) as qty, A.cal_cymd FROM test A WHERE A.cal_cymd = '2013-04-01 00:00:00.000000' group by A.cal_cymd ORDER BY A.cal_cymd ASC").show

结果:

+-----------------------+-------------------+
|qty                    |           cal_cymd|
+-----------------------+-------------------+
|                6245564|2013-04-01 00:00:00|
+-----------------------+-------------------+

解释计划:

|== Physical Plan ==
*Sort [cal_cymd#662 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(cal_cymd#662 ASC NULLS FIRST, 200)
   +- *HashAggregate(keys=[cal_cymd#662], functions=[sum(cast(qty#633 as bigint))])
      +- Exchange hashpartitioning(cal_cymd#662, 200)
         +- *HashAggregate(keys=[cal_cymd#662], functions=[partial_sum(cast(qty#633 as bigint))])
            +- *FileScan parquet test[qty#633,cal_cymd#662] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[hdfs://<namenode>/tmp/test, PartitionCount: 1, PartitionFilters: [isnotnull(cal_cymd#662), (cal_cymd#662 = 1364799600000000)], PushedFilters: [], ReadSchema: struct<qty:int>|

但是当我表演&#34;&gt; =&#34;然后它被忽略&#34; =&#34;并跳过下限值,如下所示:

hc.sql("SELECT SUM(A.qty) as qty, A.cal_cymd FROM test A WHERE A.cal_cymd >= '2013-04-01 00:00:00.000000' AND A.cal_cymd <= '2013-04-10 00:00:00.000000' group by A.cal_cymd ORDER BY A.cal_cymd ASC").show

结果:

+-----------------------+-------------------+
|qty                    |           cal_cymd|
+-----------------------+-------------------+
|                6522988|2013-04-02 00:00:00|
|                5657898|2013-04-03 00:00:00|
|                5893992|2013-04-04 00:00:00|
|                5678169|2013-04-05 00:00:00|
|                7162790|2013-04-08 00:00:00|
|                6814059|2013-04-09 00:00:00|
|                6112823|2013-04-10 00:00:00|
+-----------------------+-------------------+

解释计划:

|== Physical Plan ==
*Sort [cal_cymd#565 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(cal_cymd#565 ASC NULLS FIRST, 200)
   +- *HashAggregate(keys=[cal_cymd#565], functions=[sum(cast(qty#536 as bigint))])
      +- Exchange hashpartitioning(cal_cymd#565, 200)
         +- *HashAggregate(keys=[cal_cymd#565], functions=[partial_sum(cast(qty#536 as bigint))])
            +- *FileScan parquet test[qty#536,cal_cymd#565] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[hdfs://<namenode>/tmp/test, PartitionCount: 7, PartitionFilters: [isnotnull(cal_cymd#565), (cast(cal_cymd#565 as string) >= 2013-04-01 00:00:00.000000), (cast(cal..., PushedFilters: [], ReadSchema: struct<qty:int>|

以上结果跳过了&#34; 2013-04-01 00:00:00.000000&#34;时间戳值,但我们有与之对应的数据。

请帮助我理解为什么时间戳与&#34;&gt;&#34;工作不正常。

添加信息以使其可重现:

我有一个hive镶木地板格式的表格,其创建声明如下:

create external table test ( cal_cymd timestamp, qty int) stored as parquet location '/tmp/test' tblproperties ('parquet.compress=SNAPPY')

表中的示例数据:

+-----------------------+-------------------+
|qty                    |           cal_cymd|
+-----------------------+-------------------+
|                      3|2013-04-01 00:00:00|
|                      3|2013-04-02 00:00:00|
|                      3|2013-04-03 00:00:00|
|                      2|2013-04-04 00:00:00|
|                      3|2013-04-05 00:00:00|
|                      7|2013-04-06 00:00:00|
|                      8|2013-04-07 00:00:00|
|                      1|2013-04-08 00:00:00|
|                      1|2013-04-09 00:00:00|
|                      5|2013-04-10 00:00:00|
+-----------------------+-------------------+

0 个答案:

没有答案