对大型数据集使用正则表达式或其他搜索技术会更好吗?

时间:2018-06-22 18:25:37

标签: java algorithm hashmap bigdata

我正在逐行搜索一个非常大的文本数据库文件(5+ GB),以查找特定的模式,并将所有肯定的命中存储在哈希图中。

我要在文件中查找的行包含所谓的netblock ip range。这些行看起来像这样:

inetnum:(0.0.0.0 - 0.0.0.0)

0可以是任何IP地址。第一个数字始终是较小的数字,因为它们表示起始地址,第二个较大的数字是特定范围内的结束地址。

目前,我正在通过使用正则表达式搜索匹配的行,如下所示:

        //Regex pattern that will match an ip range in the form: 0.0.0.0 - 0.0.0.0 where any 0 can be replaced with an integer from 1 to 255
    String IPADDRESS_PATTERN =  "(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"
                             +  " - "
                             +  "(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)";

    Pattern pattern = Pattern.compile(IPADDRESS_PATTERN);

    boolean isMatch;
    String line;
    BufferedReader bReader = new BufferedReader("Foo.txt"); //foo.txt is a database text file.
    String[] ips;
    Scanner keys; // takes input from keyboard
    Map<String, String> sourceMap; // stores the start ip as the key for each netblock data structure.        

    // read each line and use the regex pattern to determine if the line has an ip range
    while ((line = bReader.readLine()) != null) {
        Matcher matcher = pattern.matcher(line);
        isMatch = matcher.find(); // I tried doing this without using a boolean, but it didn't work for some reason... not sure why.
        if (isMatch) {
            ips = line.split(" - "); // split the start and end address into the ips array, remove the dash and space characters from the strings.
            sourceMap.put(ips[0], ips[1]); // insert the start ip address as the key and the end ip address as the associated string with that key.
        }
    }
    bReader.close();
    if (sourceMap.isEmpty()) {
        System.out.println("Error: Unable to read any usable data in file.");
        System.exit(1);
    }

我的代码在这里的问题是,它的运行速度似乎比最初的天真方法慢,该方法是检查每行的长度,并检查该行是否包含单词“ inetnum”。看起来像这样:

   if(line.length() >= 27 && line.length() <= 43 && line.contains("inetnum"){
       //foo
   }

以下方法似乎比使用正则表达式匹配模式运行得更快,但我担心准确性。一个比另一个更好吗?还是我没有想到的替代解决方案?

0 个答案:

没有答案