如何在Hive中使用正则表达式来解析Apache日志时间戳?

时间:2016-04-27 19:19:39

标签: regex logging hive timestamp

我的日志文件记录如下所示:

107.344.154.200 - - [23 / Aug / 2005:00:03:14 -0400]" GET /images/theimage.gif HTTP / 1.0" 200 11401

我有这个语法来解析日志文件

  

CREATE TABLE日志文件(
  主持人STRING,
    身份STRING,
    用户STRING,
    时间STRING,
  请求STRING,
  状态STRING,大小    STRING)   行格式SERDE' org.apache.hadoop.hive.serde2.RegexSerDe'   with SERDEPROPERTIES(" input.regex" ="([^] )([^] )([^] )   ( - | \ [[^ \]] \])([^ \"] | \" [^ \"] \&# 34;)( - | [0-9] )( - | [0-9] )",
  " output.format.string" ="%1 $ s%2 $ s%3 $ s%4 $ s%5 $ s%6 $ s%7 $ s" )存储   作为文本文件;

我可以使用什么正则表达式语法来解析它将按日分钟秒分割的时间[23 / Aug / 2005:00:03:14 -0400]?

1 个答案:

答案 0 :(得分:1)

描述

此正则表达式将执行以下操作:

  • 解析日志条目并查找日期和时间
  • 捕获各种日期部分,如日,月,年,小时,分钟,秒,UTC偏移量

正则表达式

\[(\d{2})/([a-zA-Z]{3})/(\d{4}):(\d{2}):(\d{2}):(\d{2})\s(-\d{4})]

注意,根据您可能必须通过/替换它们来逃避\/的语言。但语言不同。

解释

Regular expression visualization

NODE                     EXPLANATION
----------------------------------------------------------------------
  \[                       '['
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    \d{2}                    digits (0-9) (2 times)
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  /                        '/'
----------------------------------------------------------------------
  (                        group and capture to \2:
----------------------------------------------------------------------
    [a-zA-Z]{3}              any character of: 'a' to 'z', 'A' to 'Z'
                             (3 times)
----------------------------------------------------------------------
  )                        end of \2
----------------------------------------------------------------------
  /                        '/'
----------------------------------------------------------------------
  (                        group and capture to \3:
----------------------------------------------------------------------
    \d{4}                    digits (0-9) (4 times)
----------------------------------------------------------------------
  )                        end of \3
----------------------------------------------------------------------
  :                        ':'
----------------------------------------------------------------------
  (                        group and capture to \4:
----------------------------------------------------------------------
    \d{2}                    digits (0-9) (2 times)
----------------------------------------------------------------------
  )                        end of \4
----------------------------------------------------------------------
  :                        ':'
----------------------------------------------------------------------
  (                        group and capture to \5:
----------------------------------------------------------------------
    \d{2}                    digits (0-9) (2 times)
----------------------------------------------------------------------
  )                        end of \5
----------------------------------------------------------------------
  :                        ':'
----------------------------------------------------------------------
  (                        group and capture to \6:
----------------------------------------------------------------------
    \d{2}                    digits (0-9) (2 times)
----------------------------------------------------------------------
  )                        end of \6
----------------------------------------------------------------------
  \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
  (                        group and capture to \7:
----------------------------------------------------------------------
    -                        '-'
----------------------------------------------------------------------
    \d{4}                    digits (0-9) (4 times)
----------------------------------------------------------------------
  )                        end of \7
----------------------------------------------------------------------
  ]                        ']'
----------------------------------------------------------------------

示例文字

107.344.154.200 - - [23/Aug/2005:00:03:14 -0400] "GET /images/theimage.gif HTTP/1.0" 200 11401

现场演示

https://regex101.com/r/hF4fP8/1

样本匹配

[0][0] = [23/Aug/2005:00:03:14 -0400]
[0][1] = 23
[0][2] = Aug
[0][3] = 2005
[0][4] = 00
[0][5] = 03
[0][6] = 14
[0][7] = -0400