我希望进一步打破这种数据 - 而不是URL部分,我想要更多细节,如部门,类别,产品等(如果有的话)。
这是来自Cloudera的教程
要解码的文字:
150.47.54.136 - - [14/Jun/2014:10:30:14 -0400] "GET /department/fan%20shop/category/water%20sports/product/Pelican%20Sunstream%20100%20Kayak/add_to_cart HTTP/1.1" 200 1932 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"
Cloudera教程命令在Hue
中的HIVE查询编辑器应用程序中执行CREATE EXTERNAL TABLE intermediate_access_logs (
ip STRING,
date STRING,
method STRING,
url STRING,
http_version STRING,
code1 STRING,
code2 STRING,
dash STRING,
user_agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'input.regex' = '([^ ]*) - - \\[([^\\]]*)\\] "([^\ ]*) ([^\ ]*) ([^\ ]*)" (\\d*) (\\d*) "([^"]*)" "([^"]*)"',
'output.format.string' = "%1$$s %2$$s %3$$s %4$$s %5$$s %6$$s %7$$s %8$$s %9$$s")
LOCATION '/user/hive/warehouse/original_access_logs';
答案 0 :(得分:0)
假设语言支持前瞻,我会使用执行以下操作的正则表达式:
"Get
字符串get
字符串并解析其文本。 /
分隔的,我们将查找category
,product
和department
键名并返回相关值正则表达式
^.*?"GET\s+(?=[^"]*?/category/([^"/]*)[/\s])(?=[^"]*?/product/([^"/]*)[/\s])(?=[^"]*?/department/([^"/]*)[/\s])
注意:要更好地查看图像,请右键单击图像,然后选择在新窗口或新选项卡中打开。
捕获论坛
category
值product
值department
值样本匹配 给出您的示例文本:
[14 / Jun / 2014:10:30:14 -0400]“GET / department / fan%20shop / category / water%20sports / product / Pelican%20Sunstream%20100%20Kayak / add_to_cart HTTP / 1.1”200 1932 “ - ”“Mozilla / 5.0(Macintosh; Intel Mac OS X 10_9_3)AppleWebKit / 537.36(KHTML,与Gecko一样)Chrome / 35.0.1916.153 Safari / 537.36”
[0] = 150.47.54.136 - - [14/Jun/2014:10:30:14 -0400] "GET
[1] = water%20sports
[2] = Pelican%20Sunstream%20100%20Kayak
[3] = fan%20shop
一般来说,正则表达式会向前移动光标,直到找到"get
字符串,从那里它使用正向前瞻来收集各种值。可以复制这些前瞻以根据需要收集额外的子串。
NODE EXPLANATION
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
.*? any character (0 or more times (matching
the least amount possible))
----------------------------------------------------------------------
"GET '"GET'
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
[^"]*? any character except: '"' (0 or more
times (matching the least amount
possible))
----------------------------------------------------------------------
/category/ '/category/'
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[^"/]* any character except: '"', '/' (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
[/\s] any character of: '/', whitespace (\n,
\r, \t, \f, and " ")
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
[^"]*? any character except: '"' (0 or more
times (matching the least amount
possible))
----------------------------------------------------------------------
/product/ '/product/'
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
[^"/]* any character except: '"', '/' (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
[/\s] any character of: '/', whitespace (\n,
\r, \t, \f, and " ")
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
[^"]*? any character except: '"' (0 or more
times (matching the least amount
possible))
----------------------------------------------------------------------
/department/ '/department/'
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
[^"/]* any character except: '"', '/' (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
[/\s] any character of: '/', whitespace (\n,
\r, \t, \f, and " ")
----------------------------------------------------------------------
) end of look-ahead