如何使用Pig加载复杂的Web日志语法?

时间:2013-07-03 21:58:22

标签: hadoop mapreduce apache-pig

我是猪的初学者。我已经安装了cdh4 pig,我连接到cdh4集群。我们需要处理这些大规模的Web日志文件(文件已经加载到HDFS)。不幸的是,日志语法非常复杂(不是典型的逗号分隔文件)。限制是我目前无法使用其他工具预处理日志文件,因为它们太大而且无法负担存储副本的费用。这是日志中的原始行:

  

“2013-07-02 16:17:12   -0700" ,“C = Thing.Render&安培; d = {%22renderType%22:%22Primary%22%22renderSource%22:%22Folio%22%22things%22:[{%22itemId%22:%225442f624492068b7ce7e2dd59339ef35% 22,%22userItemId%22:%22873ef2080b337b57896390c9f747db4d%22%22listId%22:%22bf5bbeaa8eae459a83fb9e2ceb99930d%22%22ownerId%22:%222a4034e6b2e800c3ff2f128fa4f1b387%22}],%22redirectId%22:%22tgvm%22%22sourceId%22:% 226da6f959-8309-4387-84c6-a5ddc10c22dd%22%22valid%22:假,%22pageLoadId%22:%224ada55ef-4ea9-4642-ada5-d053c45c00a4%22%22clientTime%22:%222013-07-02T23:18 :07.243Z%22%22clientTimeZone%22:5,%22process%22:%22ml.mobileweb.fb%22%22C%22:%22Thing.Render%22} “ ”http://m.someurl.com/listthing/5442f624492068b7ce7e2dd59339ef35?rdrId=tgvm&userItemId=873ef2080b337b57896390c9f747db4d&fmlrdr=t&itemId=5442f624492068b7ce7e2dd59339ef35&subListId=bf5bbeaa8eae459a83fb9e2ceb99930d&puid=2a4034e6b2e800c3ff2f128fa4f1b387&mlrdr=t“,” Mozilla的/ 5.0   (iPhone; CPU iPhone OS 6_1_3,如Mac OS X)AppleWebKit / 536.26(KHTML,   像Gecko)Mobile / 10B329   [FBAN / FBIOS; FBAV / 6.2; FBBV / 228172; FBDV / iPhone4,1; FBMD / iPhone; FBSN / iPhone   OS; FBSV / 6.1.3; FBSS / 2;   FBCR /冲刺; FBID /电话; FBLC / EN_US; FBOP / 1] “ ”10.nn.nn.nnn“,” nn.nn.nn.nn,   nn.nn.0.20"

正如您可能已经注意到那里嵌入了一些json,但它是url编码的。在url解码后(Pig可以解码?)这里是json的样子:

  

{ “renderType”: “主”, “renderSource”: “开本”, “东西”:[{ “的itemId”: “5442f624492068b7ce7e2dd59339ef35”, “userItemId”: “873ef2080b337b57896390c9f747db4d”, “listId”: “bf5bbeaa8eae459a83fb9e2ceb99930d”, “OWNERID”: “2a4034e6b2e800c3ff2f128fa4f1b387”}], “redirectId”: “tgvm”, “的sourceID”: “6da6f959-8309-4387-84c6-a5ddc10c22dd”, “有效”:假的, “pageLoadId”:“4ada55ef-4ea9-4642 -ada5-d053c45c00a4" , “clientTime”: “2013-07-02T23:18:07.243Z”, “clientTimeZone”:5 “过程”: “ml.mobileweb.fb”, “C”: “Thing.Render” }

我需要在json和“things”字段中提取不同的字段,这实际上是一个集合。我还需要在日志中提取其他查询字符串值。 Pig可以直接处理这种源数据,如果是这样,你可以如此善良地指导我如何让Pig能够解析并加载它吗?

谢谢!

2 个答案:

答案 0 :(得分:1)

对于这样复杂的任务,您通常需要编写Load函数。 我建议在编程猪中使用Chapter 11. Writing Load and Store Functions。官方文件中的Load/Store Functions太简单了。

答案 1 :(得分:1)

我做了很多实验并学到了很多东西。尝试了几个json库,piggybank和java.net.URLDecoder。甚至尝试了CSVExcelStorage。我注册了库,并能够部分地解决问题。当我针对更大的数据集运行测试时,它开始在源数据的某些行中遇到编码问题,从而导致异常和作业失败。所以我最终使用Pig的内置正则表达式功能来提取所需的值:

A = load '/var/log/live/collector_2013-07-02-0145.log' using TextLoader();
-- fix some of the encoding issues
A = foreach A GENERATE REPLACE($0,'\\\\"','"'); 
-- super basic url-decode
A = foreach A GENERATE REPLACE($0,'%22','"');

-- extract each of the fields from the embedded json
A = foreach A GENERATE 
    REGEX_EXTRACT($0,'^.*"redirectId":"([^"\\}]+).*$',1) as redirectId, 
    REGEX_EXTRACT($0,'^.*"fromUserId":"([^"\\}]+).*$',1) as fromUserId, 
    REGEX_EXTRACT($0,'^.*"userId":"([^"\\}]+).*$',1) as userId, 
    REGEX_EXTRACT($0,'^.*"listId":"([^"\\}]+).*$',1) as listId, 
    REGEX_EXTRACT($0,'^.*"c":"([^"\\}]+).*$',1) as eventType,
    REGEX_EXTRACT($0,'^.*"renderSource":"([^"\\}]+).*$',1) as renderSource,
    REGEX_EXTRACT($0,'^.*"renderType":"([^"\\}]+).*$',1) as renderType,
    REGEX_EXTRACT($0,'^.*"engageType":"([^"\\}]+).*$',1) as engageType,
    REGEX_EXTRACT($0,'^.*"clientTime":"([^"\\}]+).*$',1) as clientTime,
    REGEX_EXTRACT($0,'^.*"clientTimeZone":([^,\\}]+).*$',1) as clientTimeZone;

如果字段的顺序不同,我决定不使用REGEX_EXTRACT_ALL。