我正在逐行搜索一个非常大的文本数据库文件(5+ GB),以查找特定的模式,并将所有肯定的命中存储在哈希图中。
我要在文件中查找的行包含所谓的netblock ip range。这些行看起来像这样:
inetnum:(0.0.0.0 - 0.0.0.0)
0可以是任何IP地址。第一个数字始终是较小的数字,因为它们表示起始地址,第二个较大的数字是特定范围内的结束地址。
目前,我正在通过使用正则表达式搜索匹配的行,如下所示:
//Regex pattern that will match an ip range in the form: 0.0.0.0 - 0.0.0.0 where any 0 can be replaced with an integer from 1 to 255
String IPADDRESS_PATTERN = "(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"
+ " - "
+ "(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)";
Pattern pattern = Pattern.compile(IPADDRESS_PATTERN);
boolean isMatch;
String line;
BufferedReader bReader = new BufferedReader("Foo.txt"); //foo.txt is a database text file.
String[] ips;
Scanner keys; // takes input from keyboard
Map<String, String> sourceMap; // stores the start ip as the key for each netblock data structure.
// read each line and use the regex pattern to determine if the line has an ip range
while ((line = bReader.readLine()) != null) {
Matcher matcher = pattern.matcher(line);
isMatch = matcher.find(); // I tried doing this without using a boolean, but it didn't work for some reason... not sure why.
if (isMatch) {
ips = line.split(" - "); // split the start and end address into the ips array, remove the dash and space characters from the strings.
sourceMap.put(ips[0], ips[1]); // insert the start ip address as the key and the end ip address as the associated string with that key.
}
}
bReader.close();
if (sourceMap.isEmpty()) {
System.out.println("Error: Unable to read any usable data in file.");
System.exit(1);
}
我的代码在这里的问题是,它的运行速度似乎比最初的天真方法慢,该方法是检查每行的长度,并检查该行是否包含单词“ inetnum”。看起来像这样:
if(line.length() >= 27 && line.length() <= 43 && line.contains("inetnum"){
//foo
}
以下方法似乎比使用正则表达式匹配模式运行得更快,但我担心准确性。一个比另一个更好吗?还是我没有想到的替代解决方案?