使用猪的正则表达式解析日志文件

时间:2015-04-12 12:03:21

标签: regex hadoop apache-pig bigdata

我需要解析下面的日志文件,脚本应该从时间戳150324-21:06:32:937378的开头到下一个时间戳的开头作为一条记录。我尝试使用库

org.apache.pig.piggybank.storage.MyRegExLoader

以自定义格式加载记录。

150324-21:06:32:937378 [mod=STB, lvl=INFO ]
    top - 21:06:33 up  3:41,  0 users,  load average: 0.75, 0.95, 0.72
    Tasks: 120 total,   3 running, 117 sleeping,   0 stopped,   0 zombie
    Cpu(s): 21.8%us, 12.9%sy,  2.9%ni, 60.7%id,  0.0%wa,  0.0%hi,  1.7%si,  0.0%st
    Mem:    317108k total,   232588k used,    84520k free,    25960k buffers
    Swap:        0k total,        0k used,        0k free,   110820k cached
      PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
    19122 root      20   0  456m  72m  37m R   72 23.5  85:50.22 Receiver           
     5859 root      20   0  349m 9128 6948 S   15  2.9  22:42.88 rmfStreamer
     150324-21:06:32:937378 [mod=STB, lvl=INFO ]
    top - 21:06:33 up  3:41,  0 users,  load average: 0.75, 0.95, 0.72
    Tasks: 120 total,   3 running, 117 sleeping,   0 stopped,   0 zombie
    Cpu(s): 21.8%us, 12.9%sy,  2.9%ni, 60.7%id,  0.0%wa,  0.0%hi,  1.7%si,  0.0%st
    Mem:    317108k total,   232588k used,    84520k free,    25960k buffers
    Swap:        0k total,        0k used,        0k free,   110820k cached
      PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
    19122 root      20   0  456m  72m  37m R   72 23.5  85:50.22 Receiver           
     5859 root      20   0  349m 9128 6948 S   15  2.9  22:42.88 rmfStreamer

这是我使用的相关代码段

raw_logs = LOAD './main*/*top_log*'   USING org.apache.pig.piggybank.storage.MyRegExLoader('(?m)(?s)\\d*-\\d{2}:\\d{2}:\\d{2}\\:\\d*.*') AS line:chararray ; DUMP raw_logs;

这是我的输出:

(150325-05:47:26:253050 [mod=STB, lvl=INFO ])
(150325-05:57:27:294069 [mod=STB, lvl=INFO ])
(150325-06:07:28:235302 [mod=STB, lvl=INFO ])
(150325-06:17:29:124282 [mod=STB, lvl=INFO ])
(150325-06:27:30:036264 [mod=STB, lvl=INFO ])
(150325-06:37:30:941804 [mod=STB, lvl=INFO ])
(150325-06:47:31:909712 [mod=STB, lvl=INFO ])

它应该像2元组

(150324-21:06:32:937378 [mod=STB, lvl=INFO ]
top - 21:06:33 up  3:41,  0 users,  load average: 0.75, 0.95, 0.72
Tasks: 120 total,   3 running, 117 sleeping,   0 stopped,   0 zombie
Cpu(s): 21.8%us, 12.9%sy,  2.9%ni, 60.7%id,  0.0%wa,  0.0%hi,  1.7%si,  0.0%st
Mem:    317108k total,   232588k used,    84520k free,    25960k buffers
Swap:        0k total,        0k used,        0k free,   110820k cached
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
19122 root      20   0  456m  72m  37m R   72 23.5  85:50.22 Receiver           
 5859 root      20   0  349m 9128 6948 S   15  2.9  22:42.88 rmfStreamer)
(150324-21:06:32:937378 [mod=STB, lvl=INFO ]
top - 21:06:33 up  3:41,  0 users,  load average: 0.75, 0.95, 0.72
Tasks: 120 total,   3 running, 117 sleeping,   0 stopped,   0 zombie
Cpu(s): 21.8%us, 12.9%sy,  2.9%ni, 60.7%id,  0.0%wa,  0.0%hi,  1.7%si,  0.0%st
Mem:    317108k total,   232588k used,    84520k free,    25960k buffers
Swap:        0k total,        0k used,        0k free,   110820k cached
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
19122 root      20   0  456m  72m  37m R   72 23.5  85:50.22 Receiver           
 5859 root      20   0  349m 9128 6948 S   15  2.9  22:42.88 rmfStreamer) 

请让我知道我可以使用的正则表达式,以便我的脚本考虑时间戳的开始,直到下一个时间戳开始一个记录。

2 个答案:

答案 0 :(得分:0)

尝试以下正则表达式的匹配组:

([0-9]{6}-[0-9]{2}:[0-9]{2}:[0-9]{2}:[0-9]+ \[mod=[\s\S]*)[0-9]{6}-[0-9]{2}:[0-9]{2}:[0-9]{2}:[0-9]+ \[mod=

答案 1 :(得分:0)

我认为使用猪是不可能的。 您将需要一个自定义记录阅读器,它使用正则表达式按照第一条记录的时间戳分割文件。

我希望以下链接可以帮助您编写一个 https://hadoopi.wordpress.com/2013/05/31/custom-recordreader-processing-string-pattern-delimited-records/

你可能需要调整它的一些逻辑来获得每行的时间戳

if (m.matches()) {
        // Record delimiter
        delimiterString=tmp;
        break;
    } else {
        // Append value to record

        text.append(EOL.getBytes(), 0, EOL.getLength());
        text.append(tmp.getBytes(), 0, tmp.getLength());
        text.append(delimiterString.getBytes(), 0, delimiterString.getLength() );
    }

结果将如下所示  热门 - 02:10:39最多0分钟,0位用户,平均负载:2.26,0.54,0.18150323-02:10:3​​7:619962 [mod = STB,lvl = INFO] 任务:133总计,6跑,127睡,0停,0 zombie150323-02:10:3​​7:619962 [mod = STB,lvl = INFO]