我有一个hive表,其中包含多种格式的时间戳列。您可以认为以下数据是其中的一部分。
Steven Li 1994-07-01 Master
Joe Wang Apr 01, 2001 Phd
James Hou 12-01-99 Master
Al Zhang 10-05-1998 Phd
我想识别这四种格式并将它们统一到Unix_timestamp。我使用以下代码:
select name,
case
when(regexp_extract(ts, "\\d{4}-\\d{2}-\\d{2}", 0) is not null) then UNIX_TIMESTAMP(ts, "yyyy-MM-dd")
when(regexp_extract(ts, "[a-zA-Z]{3} \\d{2}, \\d{4}", 0) is not null) then UNIX_TIMESTAMP(ts, "MMM dd, yyyy")
when(regexp_extract(ts, "\\d{2}-\\d{2}-\\d{2}", 0) is not null) then UNIX_TIMESTAMP(ts, "MM-dd-yy")
when(regexp_extract(ts, "\\d{2}-\\d{2}-\\d{4}", 0) is not null) then UNIX_TIMESTAMP(ts, "MM-dd-yyyy")
end as ts_ext,
education
from ts_raw_ext;
输出是:
Steven Li NULL Master
Joe Wang 986083200 Phd
James Hou NULL Master
Al Zhang NULL Phd
我在regex101网站上测试了所有正则表达式,似乎它们都没问题。但输出是错误的。谁能告诉我如何完成工作?谢谢!
答案 0 :(得分:0)
您提供的查询是正确的。检查
中的数据类型ts
hive> desc ts_raw_ext;
OK
name string
ts string
education string
Time taken: 0.477 seconds, Fetched: 3 row(s)
hive>
hive> select * from ts_raw_ext;
OK
Steven Li 1994-07-01 Master
Time taken: 0.18 seconds, Fetched: 1 row(s)
hive>
> select name,
> case
> when(regexp_extract(ts, "\\d{4}-\\d{2}-\\d{2}", 0) is not null) then UNIX_TIMESTAMP(ts, "yyyy-MM-dd")
> when(regexp_extract(ts, "[a-zA-Z]{3} \\d{2}, \\d{4}", 0) is not null) then UNIX_TIMESTAMP(ts, "MMM dd, yyyy")
> when(regexp_extract(ts, "\\d{2}-\\d{2}-\\d{2}", 0) is not null) then UNIX_TIMESTAMP(ts, "MM-dd-yy")
> when(regexp_extract(ts, "\\d{2}-\\d{2}-\\d{4}", 0) is not null) then UNIX_TIMESTAMP(ts, "MM-dd-yyyy")
> end as ts_ext,
> education
> from ts_raw_ext;
OK
Steven Li 773020800 Master
Time taken: 0.284 seconds, Fetched: 1 row(s)
hive>
答案 1 :(得分:0)
我解决了这个问题。 当我使用case语句时,我犯了一个错误。我忘了在大小写关键字之后使用可选表达式。我将我的代码更改为以下内容并且有效:
select name,
case ts
when regexp_extract(ts, "\\d{4}-\\d{2}-\\d{2}", 0) then UNIX_TIMESTAMP(ts, "yyyy-MM-dd")
when regexp_extract(ts, "[a-zA-Z]{3} \\d{2}, \\d{4}", 0) then UNIX_TIMESTAMP(ts, "MMM dd, yyyy")
when regexp_extract(ts, "\\d{2}-\\d{2}-\\d{2}", 0) then UNIX_TIMESTAMP(ts, "MM-dd-yy")
when regexp_extract(ts, "\\d{2}-\\d{2}-\\d{4}", 0) then UNIX_TIMESTAMP(ts, "MM-dd-yyyy")
end as ts_ext,
education
from ts_raw_ext;
此代码的输出为:
+------------+------------+------------+--+
| name | ts_ext | education |
+------------+------------+------------+--+
| Steven Li | 773020800 | Master |
| Joe Wang | 986112000 | Phd |
| James Hou | 944035200 | Master |
| Al Zhang | 907570800 | Phd |
+------------+------------+------------+--+