我无法披露因客户而导致的表格名称 合规性。
我们正在使用Spark 1.6.0。我有一张桌子"测试"使用timestamp列" cal_cymd"和另一个与数量相关的列,即"数量"。当我使用" ="运行查询时运算符在where子句中然后它返回如下行:
hc.sql("SELECT SUM(A.qty) as qty, A.cal_cymd FROM test A WHERE A.cal_cymd = '2013-04-01 00:00:00.000000' group by A.cal_cymd ORDER BY A.cal_cymd ASC").show
结果:
+-----------------------+-------------------+
|qty | cal_cymd|
+-----------------------+-------------------+
| 6245564|2013-04-01 00:00:00|
+-----------------------+-------------------+
解释计划:
|== Physical Plan ==
*Sort [cal_cymd#662 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(cal_cymd#662 ASC NULLS FIRST, 200)
+- *HashAggregate(keys=[cal_cymd#662], functions=[sum(cast(qty#633 as bigint))])
+- Exchange hashpartitioning(cal_cymd#662, 200)
+- *HashAggregate(keys=[cal_cymd#662], functions=[partial_sum(cast(qty#633 as bigint))])
+- *FileScan parquet test[qty#633,cal_cymd#662] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[hdfs://<namenode>/tmp/test, PartitionCount: 1, PartitionFilters: [isnotnull(cal_cymd#662), (cal_cymd#662 = 1364799600000000)], PushedFilters: [], ReadSchema: struct<qty:int>|
但是当我表演&#34;&gt; =&#34;然后它被忽略&#34; =&#34;并跳过下限值,如下所示:
hc.sql("SELECT SUM(A.qty) as qty, A.cal_cymd FROM test A WHERE A.cal_cymd >= '2013-04-01 00:00:00.000000' AND A.cal_cymd <= '2013-04-10 00:00:00.000000' group by A.cal_cymd ORDER BY A.cal_cymd ASC").show
结果:
+-----------------------+-------------------+
|qty | cal_cymd|
+-----------------------+-------------------+
| 6522988|2013-04-02 00:00:00|
| 5657898|2013-04-03 00:00:00|
| 5893992|2013-04-04 00:00:00|
| 5678169|2013-04-05 00:00:00|
| 7162790|2013-04-08 00:00:00|
| 6814059|2013-04-09 00:00:00|
| 6112823|2013-04-10 00:00:00|
+-----------------------+-------------------+
解释计划:
|== Physical Plan ==
*Sort [cal_cymd#565 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(cal_cymd#565 ASC NULLS FIRST, 200)
+- *HashAggregate(keys=[cal_cymd#565], functions=[sum(cast(qty#536 as bigint))])
+- Exchange hashpartitioning(cal_cymd#565, 200)
+- *HashAggregate(keys=[cal_cymd#565], functions=[partial_sum(cast(qty#536 as bigint))])
+- *FileScan parquet test[qty#536,cal_cymd#565] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[hdfs://<namenode>/tmp/test, PartitionCount: 7, PartitionFilters: [isnotnull(cal_cymd#565), (cast(cal_cymd#565 as string) >= 2013-04-01 00:00:00.000000), (cast(cal..., PushedFilters: [], ReadSchema: struct<qty:int>|
以上结果跳过了&#34; 2013-04-01 00:00:00.000000&#34;时间戳值,但我们有与之对应的数据。
请帮助我理解为什么时间戳与&#34;&gt;&#34;工作不正常。
添加信息以使其可重现:
我有一个hive镶木地板格式的表格,其创建声明如下:
create external table test ( cal_cymd timestamp, qty int) stored as parquet location '/tmp/test' tblproperties ('parquet.compress=SNAPPY')
表中的示例数据:
+-----------------------+-------------------+
|qty | cal_cymd|
+-----------------------+-------------------+
| 3|2013-04-01 00:00:00|
| 3|2013-04-02 00:00:00|
| 3|2013-04-03 00:00:00|
| 2|2013-04-04 00:00:00|
| 3|2013-04-05 00:00:00|
| 7|2013-04-06 00:00:00|
| 8|2013-04-07 00:00:00|
| 1|2013-04-08 00:00:00|
| 1|2013-04-09 00:00:00|
| 5|2013-04-10 00:00:00|
+-----------------------+-------------------+