使用猪脚本中的正则表达式从日志中提取字符串

时间:2017-11-08 15:56:47

标签: regex hadoop apache-pig

我有日志数据,我想将每个信息提取到变量

以下是一行日志示例。 {:id => 306,:name =>“bblite”,:cpu => {:quota => 4,:allocated => 4,:actual => 0} ,: memory => {:quota => 8192,:assigned => 8192,:actual => 8578} ,: cluster_stats => {“wc1104”=> {:cpu => 0,:mem => 8578} }}

我需要包含所有ID的变量,包含所有名称的变量,包含CPU的变量和包含所有群集统计信息的变量

以下是我的猪脚本部分。我可以存储id但我不知道如何使用正则表达式提取其余的。

。 。

matching_messages = FILTER raw_lines BY (LOWER(message) MATCHES '.*cc_altus-plaform.*');

ids = FOREACH matching_messages GENERATE REGEX_EXTRACT(message,'id=>\\d*',0);

names = FOREACH matching_messages GENERATE REGEX_EXTRACT(message,'name=>\\"\\",',0);

line_with_date = FOREACH matching_messages GENERATE
DateFormatter(timestamp) AS formatted_time: chararray, message;

DUMP names;

1 个答案:

答案 0 :(得分:0)

以下代码片段是我编写的正则表达式:

id = FOREACH matching_messages GENERATE REGEX_EXTRACT(message,'(?<=id=>)\\d*',0);

name = FOREACH matching_messages GENERATE REGEX_EXTRACT(message,'name=>\\"[\\w]*\\"',0);

cpu = FOREACH matching_messages GENERATE REPLACE( REGEX_EXTRACT(message, 'cpu=>\\{.*?\\}',0), ',','');

memory = FOREACH matching_messages GENERATE REGEX_EXTRACT(message,'memory=>\\{.*?\\}',0);

cluster = FOREACH matching_messages GENERATE REGEX_EXTRACT(message,'cluster_stats=>\\{.*?\\}',0);