需要正确的regexp表达式来分解URL

时间:2016-05-11 20:10:31

标签: regex hadoop hive

我希望进一步打破这种数据 - 而不是URL部分,我想要更多细节,如部门,类别,产品等(如果有的话)。

这是来自Cloudera的教程

要解码的文字:

150.47.54.136 - - [14/Jun/2014:10:30:14 -0400] "GET /department/fan%20shop/category/water%20sports/product/Pelican%20Sunstream%20100%20Kayak/add_to_cart HTTP/1.1" 200 1932 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"

Cloudera教程命令在Hue

中的HIVE查询编辑器应用程序中执行
CREATE EXTERNAL TABLE intermediate_access_logs (
    ip STRING,
    date STRING,
    method STRING,
    url STRING,
    http_version STRING,
    code1 STRING,
    code2 STRING,
    dash STRING,
    user_agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
    'input.regex' = '([^ ]*) - - \\[([^\\]]*)\\] "([^\ ]*) ([^\ ]*) ([^\ ]*)" (\\d*) (\\d*) "([^"]*)" "([^"]*)"',
    'output.format.string' = "%1$$s %2$$s %3$$s %4$$s %5$$s %6$$s %7$$s %8$$s %9$$s")
LOCATION '/user/hive/warehouse/original_access_logs';

1 个答案:

答案 0 :(得分:0)

描述

假设语言支持前瞻,我会使用执行以下操作的正则表达式:

  • 找到"Get字符串
  • 移动get字符串并解析其文本。
  • 假设部分名称和值集都是/分隔的,我们将查找categoryproductdepartment键名并返回相关值
  • 允许键名以任何顺序出现
  • 正则表达式应该是模块化的,以允许按需添加和删除其他键/值集

正则表达式

^.*?"GET\s+(?=[^"]*?/category/([^"/]*)[/\s])(?=[^"]*?/product/([^"/]*)[/\s])(?=[^"]*?/department/([^"/]*)[/\s])

Regular expression visualization

注意:要更好地查看图像,请右键单击图像,然后选择在新窗口或新选项卡中打开。

捕获论坛

  • 捕获组0将包含整个源字符串的片段
  • 捕获组1将具有category
  • 捕获组2将具有product
  • 捕获组3将具有department

样本匹配 给出您的示例文本:

  

[14 / Jun / 2014:10:3​​0:14 -0400]“GET / department / fan%20shop / category / water%20sports / product / Pelican%20Sunstream%20100%20Kayak / add_to_cart HTTP / 1.1”200 1932 “ - ”“Mozilla / 5.0(Macintosh; Intel Mac OS X 10_9_3)AppleWebKit / 537.36(KHTML,与Gecko一样)Chrome / 35.0.1916.153 Safari / 537.36”

[0] = 150.47.54.136 - - [14/Jun/2014:10:30:14 -0400] "GET 
[1] = water%20sports
[2] = Pelican%20Sunstream%20100%20Kayak
[3] = fan%20shop

解释

一般来说,正则表达式会向前移动光标,直到找到"get字符串,从那里它使用正向前瞻来收集各种值。可以复制这些前瞻以根据需要收集额外的子串。

NODE                     EXPLANATION
----------------------------------------------------------------------
  ^                        the beginning of the string
----------------------------------------------------------------------
  .*?                      any character (0 or more times (matching
                           the least amount possible))
----------------------------------------------------------------------
  "GET                     '"GET'
----------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (?=                      look ahead to see if there is:
----------------------------------------------------------------------
    [^"]*?                   any character except: '"' (0 or more
                             times (matching the least amount
                             possible))
----------------------------------------------------------------------
    /category/               '/category/'
----------------------------------------------------------------------
    (                        group and capture to \1:
----------------------------------------------------------------------
      [^"/]*                   any character except: '"', '/' (0 or
                               more times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )                        end of \1
----------------------------------------------------------------------
    [/\s]                    any character of: '/', whitespace (\n,
                             \r, \t, \f, and " ")
----------------------------------------------------------------------
  )                        end of look-ahead
----------------------------------------------------------------------
  (?=                      look ahead to see if there is:
----------------------------------------------------------------------
    [^"]*?                   any character except: '"' (0 or more
                             times (matching the least amount
                             possible))
----------------------------------------------------------------------
    /product/                '/product/'
----------------------------------------------------------------------
    (                        group and capture to \2:
----------------------------------------------------------------------
      [^"/]*                   any character except: '"', '/' (0 or
                               more times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )                        end of \2
----------------------------------------------------------------------
    [/\s]                    any character of: '/', whitespace (\n,
                             \r, \t, \f, and " ")
----------------------------------------------------------------------
  )                        end of look-ahead
----------------------------------------------------------------------
  (?=                      look ahead to see if there is:
----------------------------------------------------------------------
    [^"]*?                   any character except: '"' (0 or more
                             times (matching the least amount
                             possible))
----------------------------------------------------------------------
    /department/             '/department/'
----------------------------------------------------------------------
    (                        group and capture to \3:
----------------------------------------------------------------------
      [^"/]*                   any character except: '"', '/' (0 or
                               more times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )                        end of \3
----------------------------------------------------------------------
    [/\s]                    any character of: '/', whitespace (\n,
                             \r, \t, \f, and " ")
----------------------------------------------------------------------
  )                        end of look-ahead