如何从另一个Rdd中的日志文件中过滤IP

时间:2016-02-06 12:04:43

标签: java apache-spark

我从Access日志文件中获取IP,尝试使用Pattern但没有获得正确的输出。

public class IPcount {
    public static void main(String[] args) {

    String IPADDRESS_PATTERN = "(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)";
    Pattern pattern = Pattern.compile(IPADDRESS_PATTERN);
    Matcher matcher = pattern.matcher(t);

    JavaSparkContext sc = new JavaSparkContext("local", "IPcount");
    @SuppressWarnings({ "unused", "serial" })
    JavaRDD<String> lines = sc.textFile("/home/bhaumik/Documents/access_log", 5)
            .flatMap(new FlatMapFunction<String, String>() {

                @Override
                public Iterable<String> call(String t) throws Exception {
                    // TODO Auto-generated method stub
                    return null; //HERE WHAT SHOULD I DO SO THAT I CAN GET IP FILTER FROM THE LOG FILE.
                }
            });
    }
}

1 个答案:

答案 0 :(得分:1)

这是一种从gem install rails --no-ri --no-rdoc 中提取IP的Java方法,假设每行可能包含零个,一个或多个IP:

JavaRDD<String>