如何在hive中统一时间戳格式

时间:2016-11-01 03:49:31

标签: regex apache hadoop hive

我有一个hive表,其中包含多种格式的时间戳列。您可以认为以下数据是其中的一部分。

Steven Li       1994-07-01      Master
Joe Wang        Apr 01, 2001    Phd
James Hou       12-01-99        Master
Al Zhang        10-05-1998      Phd

我想识别这四种格式并将它们统一到Unix_timestamp。我使用以下代码:

select name,
    case
        when(regexp_extract(ts, "\\d{4}-\\d{2}-\\d{2}", 0) is not null) then UNIX_TIMESTAMP(ts, "yyyy-MM-dd")
        when(regexp_extract(ts, "[a-zA-Z]{3} \\d{2}, \\d{4}", 0) is not null) then UNIX_TIMESTAMP(ts, "MMM dd, yyyy")
        when(regexp_extract(ts, "\\d{2}-\\d{2}-\\d{2}", 0) is not null) then UNIX_TIMESTAMP(ts, "MM-dd-yy")
        when(regexp_extract(ts, "\\d{2}-\\d{2}-\\d{4}", 0) is not null) then UNIX_TIMESTAMP(ts, "MM-dd-yyyy")
    end as ts_ext,
    education
from ts_raw_ext;

输出是:

Steven Li       NULL    Master
Joe Wang        986083200       Phd
James Hou       NULL    Master
Al Zhang        NULL    Phd

我在regex101网站上测试了所有正则表达式,似乎它们都没问题。但输出是错误的。谁能告诉我如何完成工作?谢谢!

2 个答案:

答案 0 :(得分:0)

您提供的查询是正确的。检查

中的数据类型ts
hive> desc ts_raw_ext;
OK
name                    string                                      
ts                      string                                      
education               string                                      
Time taken: 0.477 seconds, Fetched: 3 row(s)
hive>

hive> select * from ts_raw_ext;
OK
Steven Li   1994-07-01  Master
Time taken: 0.18 seconds, Fetched: 1 row(s)
hive> 
    > select name,
    >     case
    >         when(regexp_extract(ts, "\\d{4}-\\d{2}-\\d{2}", 0) is not null) then UNIX_TIMESTAMP(ts, "yyyy-MM-dd")
    >         when(regexp_extract(ts, "[a-zA-Z]{3} \\d{2}, \\d{4}", 0) is not null) then UNIX_TIMESTAMP(ts, "MMM dd, yyyy")
    >         when(regexp_extract(ts, "\\d{2}-\\d{2}-\\d{2}", 0) is not null) then UNIX_TIMESTAMP(ts, "MM-dd-yy")
    >         when(regexp_extract(ts, "\\d{2}-\\d{2}-\\d{4}", 0) is not null) then UNIX_TIMESTAMP(ts, "MM-dd-yyyy")
    >     end as ts_ext,
    >     education
    > from ts_raw_ext;
OK
Steven Li   773020800   Master
Time taken: 0.284 seconds, Fetched: 1 row(s)
hive>

答案 1 :(得分:0)

我解决了这个问题。 当我使用case语句时,我犯了一个错误。我忘了在大小写关键字之后使用可选表达式。我将我的代码更改为以下内容并且有效:

select name,
    case ts
        when regexp_extract(ts, "\\d{4}-\\d{2}-\\d{2}", 0) then UNIX_TIMESTAMP(ts, "yyyy-MM-dd")
        when regexp_extract(ts, "[a-zA-Z]{3} \\d{2}, \\d{4}", 0) then UNIX_TIMESTAMP(ts, "MMM dd, yyyy")
        when regexp_extract(ts, "\\d{2}-\\d{2}-\\d{2}", 0) then UNIX_TIMESTAMP(ts, "MM-dd-yy")
        when regexp_extract(ts, "\\d{2}-\\d{2}-\\d{4}", 0) then UNIX_TIMESTAMP(ts, "MM-dd-yyyy")
    end as ts_ext,
    education
from ts_raw_ext;

此代码的输出为:

+------------+------------+------------+--+
|    name    |   ts_ext   | education  |
+------------+------------+------------+--+
| Steven Li  | 773020800  | Master     |
| Joe Wang   | 986112000  | Phd        |
| James Hou  | 944035200  | Master     |
| Al Zhang   | 907570800  | Phd        |
+------------+------------+------------+--+