Java提取字段使用正则表达式分隔子字符串

时间:2013-10-14 22:17:03

标签: java regex hadoop syslog flume

如何使用正则表达式从系统日志消息中提取程序名?我有一个Java流处理模块,它接受正则表达式来处理系统日志消息。

日志行可以是:

2013-10-14T22:05:29+00:00 hostname sshd[6359]: Connection closed by 192.168.1.10
2013-10-14T22:05:29+00:00 hostname sshd:3322 Connection closed by 192.168.1.10
2013-10-14T22:05:29+00:00 hostname sshd/6359 Connection closed by 192.168.1.10
2013-10-14T22:05:29+00:00 hostname sshd Connection closed by 192.168.1.10
2013-10-14T22:05:29+00:00 hostname SSHD[1133] Connection closed by 192.168.1.10
2013-10-14T22:05:29+00:00 hostname SSH.D[6359]: Connection closed by 192.168.1.10

字符串提取过程应该是:取空格分隔的第三个子字符串,并提取以[:/或空格

因此,在前四个日志样本中,提取的字符串为sshd,第五个SSHD和第六个SSH.D。这是正确的吗?

编辑:

我尝试的是((?:[A-Za-z][A-Za-z0-9_.-]+))它似乎有效,但说实话,我修改了一个示例正则表达式并使用在线工具调整它直到它适合我的用例但我不确定它是如何工作的

4 个答案:

答案 0 :(得分:1)

Double split应该完成这项工作:

String token = data.split(" +")[2].split("[\\[:/]")[0];

答案 1 :(得分:0)

尝试这样的事情:

String str = line.split(" ")[2].replaceAll("(.+)(\\[|\\:|\\/).+", "$1");

尚未测试过。

答案 2 :(得分:0)

我认为你正在寻找的正则表达式是:

String regex = "([^\\[:/]+).*";

.*表示匹配0或更多任何字符。将一对括号放在点星().*前面会创建一个可以从匹配器中选择的组。由于它是第一组括号,因此它由组号1引用。括号内是一个表达式,它匹配包含OP中指定的字符的否定字符类[^]+中的一个或多个,特别是“[ “,”:“和”/“字符。

以下是测试结果的示例应用程序:

package com.stackexchange.stackoverflow;

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Question19370191 {
    public static void main(String[] args) {
        String regex = "([^\\[:/]+).*";
        Pattern pattern = Pattern.compile(regex);

        List<String> lines = new ArrayList<>();
        lines.add("2013-10-14T22:05:29+00:00 hostname sshd[6359]: Connection closed by 192.168.1.10");
        lines.add("2013-10-14T22:05:29+00:00 hostname sshd:3322 Connection closed by 192.168.1.10");
        lines.add("2013-10-14T22:05:29+00:00 hostname sshd/6359 Connection closed by 192.168.1.10");
        lines.add("2013-10-14T22:05:29+00:00 hostname sshd Connection closed by 192.168.1.10");
        lines.add("2013-10-14T22:05:29+00:00 hostname SSHD[1133] Connection closed by 192.168.1.10");
        lines.add("2013-10-14T22:05:29+00:00 hostname SSH.D[6359]: Connection closed by 192.168.1.10");

        for(String line : lines) {
            String field = line.split("\\s")[2];
            String extraction = "";
            Matcher matcher = pattern.matcher(field);
            if(matcher.matches()) {
                extraction = matcher.group(1);
            }

            System.out.println(String.format("Field \"%-12s\" Extraction \"%s\"", field, extraction));
        }
    }
}

输出以下内容:

Field "sshd[6359]: " Extraction "sshd"
Field "sshd:3322   " Extraction "sshd"
Field "sshd/6359   " Extraction "sshd"
Field "sshd        " Extraction "sshd"
Field "SSHD[1133]  " Extraction "SSHD"
Field "SSH.D[6359]:" Extraction "SSH.D"

答案 3 :(得分:0)

如果您的示例数据与您提供的完全相同:

(?:.+?\s){2}([\w\.]+).+$

说明:

(?:.+?\s){2} ...匹配第二个空格

([^\s[:/]+) ...匹配任何不是'',':'或'/'

的内容

.+$ ...与EOL匹配

您想要的内容将在捕获的群组\1