应用错误收集

我希望使用Athena从S3访问日志中获取查询参数的映射。

E.g。对于以下日志行示例：

283e.. foo [17/Jun/2017:23:00:49 +0000] 76.117.221.205 - 1D0.. REST.GET.OBJECT 1x1.gif "GET /foo.bar/1x1.gif?placement_tag_id=0&r=574&placement_hash=12345... HTTP/1.1" 200 ... "Mozilla/5.0"

我想获得[k，v]的地图queryParams：

placement_tag_id，0 R，574 placement_hash，12345

所以我将能够运行诸如以下的查询：

select * from accessLogs where queryParams.placement_tag_id=0 and X.r>=500

查询参数计数和内容因请求而异，因此我无法使用静态RegEx模式。

我在以下Athena创建表查询上使用serde2.RegexSerDe来对日志进行基本拆分，但没有找到实现我想要的方法。我想过使用MultiDelimitSerDe但Athena不支持它。

关于如何实现这一目标的任何建议？

CREATE EXTERNAL TABLE IF NOT EXISTS elb_db.accessLogs ( timestamp string, request string, http_status string, user_agent string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = '1', 'input.regex' = '[^ ]* [^ ]* \\[(.*)\\] [^ ]* [^ ]* [^ ]* [^ ]* [^ ]* "(.*?)" ([^ ]*) [^ ]* [^ ]* [^ ]* [^ ]* [^ ]* ".*?" "(.*?)" [^ ]*' ) LOCATION 's3://output/bucket'

使用Athena从S3访问日志中获取查询参数

0 个答案: