我有一个带有数据类型字符串的时间戳列。数据格式为“yyyy-mm-dd hh:mm:ss”。 我有两个解决方案只能检索日期部分。
在性能方面,哪一项是针对庞大数据量的更好解决方案?
答案 0 :(得分:0)
我认为你的问题的答案取决于很多事情,但总的来说,查看解释计划是一个很好的起点。在我的测试中,计划似乎没有差异。
注意:这是在Hive版本1.1.0-cdh5.12.2上的Cloudera环境中测试的
使用TO_DATE():
+----------------------------------------------------+--+
| Explain |
+----------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: a |
| Statistics: Num rows: 163043612 Data size: 178714012511 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: to_date(some_date) (type: string) |
| outputColumnNames: _col0 |
| Statistics: Num rows: 163043612 Data size: 178714012511 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 163043612 Data size: 178714012511 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+--+
使用SUBSTR():
+----------------------------------------------------+--+
| Explain |
+----------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: b |
| Statistics: Num rows: 163043612 Data size: 178714012511 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: substr(some_date, 1, 10) (type: string) |
| outputColumnNames: _col0 |
| Statistics: Num rows: 163043612 Data size: 178714012511 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 163043612 Data size: 178714012511 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+--+