描述

Question

我希望进一步打破这种数据 - 而不是URL部分，我想要更多细节，如部门，类别，产品等（如果有的话）。

这是来自Cloudera的教程

要解码的文字：

150.47.54.136 - - [14/Jun/2014:10:30:14 -0400] "GET /department/fan%20shop/category/water%20sports/product/Pelican%20Sunstream%20100%20Kayak/add_to_cart HTTP/1.1" 200 1932 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"

Cloudera教程命令在Hue

中的HIVE查询编辑器应用程序中执行

CREATE EXTERNAL TABLE intermediate_access_logs (
    ip STRING,
    date STRING,
    method STRING,
    url STRING,
    http_version STRING,
    code1 STRING,
    code2 STRING,
    dash STRING,
    user_agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
    'input.regex' = '([^ ]*) - - \\[([^\\]]*)\\] "([^\ ]*) ([^\ ]*) ([^\ ]*)" (\\d*) (\\d*) "([^"]*)" "([^"]*)"',
    'output.format.string' = "%1$$s %2$$s %3$$s %4$$s %5$$s %6$$s %7$$s %8$$s %9$$s")
LOCATION '/user/hive/warehouse/original_access_logs';

Answer 1

描述

假设语言支持前瞻，我会使用执行以下操作的正则表达式：

找到"Get字符串
移动get字符串并解析其文本。
假设部分名称和值集都是/分隔的，我们将查找category，product和department键名并返回相关值
允许键名以任何顺序出现
正则表达式应该是模块化的，以允许按需添加和删除其他键/值集

正则表达式

^.*?"GET\s+(?=[^"]*?/category/([^"/]*)[/\s])(?=[^"]*?/product/([^"/]*)[/\s])(?=[^"]*?/department/([^"/]*)[/\s])

Regular expression visualization

注意：要更好地查看图像，请右键单击图像，然后选择在新窗口或新选项卡中打开。

捕获论坛

捕获组0将包含整个源字符串的片段
捕获组1将具有category值
捕获组2将具有product值
捕获组3将具有department值

样本匹配 给出您的示例文本：

[14 / Jun / 2014：10：30：14 -0400]“GET / department / fan％20shop / category / water％20sports / product / Pelican％20Sunstream％20100％20Kayak / add_to_cart HTTP / 1.1”200 1932 “ - ”“Mozilla / 5.0（Macintosh; Intel Mac OS X 10_9_3）AppleWebKit / 537.36（KHTML，与Gecko一样）Chrome / 35.0.1916.153 Safari / 537.36”

[0] = 150.47.54.136 - - [14/Jun/2014:10:30:14 -0400] "GET 
[1] = water%20sports
[2] = Pelican%20Sunstream%20100%20Kayak
[3] = fan%20shop

解释

一般来说，正则表达式会向前移动光标，直到找到"get字符串，从那里它使用正向前瞻来收集各种值。可以复制这些前瞻以根据需要收集额外的子串。

NODE                     EXPLANATION
----------------------------------------------------------------------
  ^                        the beginning of the string
----------------------------------------------------------------------
  .*?                      any character (0 or more times (matching
                           the least amount possible))
----------------------------------------------------------------------
  "GET                     '"GET'
----------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (?=                      look ahead to see if there is:
----------------------------------------------------------------------
    [^"]*?                   any character except: '"' (0 or more
                             times (matching the least amount
                             possible))
----------------------------------------------------------------------
    /category/               '/category/'
----------------------------------------------------------------------
    (                        group and capture to \1:
----------------------------------------------------------------------
      [^"/]*                   any character except: '"', '/' (0 or
                               more times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )                        end of \1
----------------------------------------------------------------------
    [/\s]                    any character of: '/', whitespace (\n,
                             \r, \t, \f, and " ")
----------------------------------------------------------------------
  )                        end of look-ahead
----------------------------------------------------------------------
  (?=                      look ahead to see if there is:
----------------------------------------------------------------------
    [^"]*?                   any character except: '"' (0 or more
                             times (matching the least amount
                             possible))
----------------------------------------------------------------------
    /product/                '/product/'
----------------------------------------------------------------------
    (                        group and capture to \2:
----------------------------------------------------------------------
      [^"/]*                   any character except: '"', '/' (0 or
                               more times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )                        end of \2
----------------------------------------------------------------------
    [/\s]                    any character of: '/', whitespace (\n,
                             \r, \t, \f, and " ")
----------------------------------------------------------------------
  )                        end of look-ahead
----------------------------------------------------------------------
  (?=                      look ahead to see if there is:
----------------------------------------------------------------------
    [^"]*?                   any character except: '"' (0 or more
                             times (matching the least amount
                             possible))
----------------------------------------------------------------------
    /department/             '/department/'
----------------------------------------------------------------------
    (                        group and capture to \3:
----------------------------------------------------------------------
      [^"/]*                   any character except: '"', '/' (0 or
                               more times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )                        end of \3
----------------------------------------------------------------------
    [/\s]                    any character of: '/', whitespace (\n,
                             \r, \t, \f, and " ")
----------------------------------------------------------------------
  )                        end of look-ahead

需要正确的regexp表达式来分解URL

1 个答案:

描述

解释