我有一个文件名列表(近40万)。我需要解析每个文件的内容并找到给定的字符串模式。
任何人都可以帮助我提高搜索过程的最佳方式(我能够在90秒内处理内容)。
以下是需要优化的代码段。
/**
* This method is called over a list of files and file is parsed char by char and compared with pattern using prefix table( used in KMP algorithm).
*
* @param pattern
* Pattern to be searched
*
* @param prefixTable
* Prefix table is build is using KMP algorithm.
* Example:- For a given pattern => results sets are { "ababaca" => 0012301, "abcdabca" => 00001231, "aababca" => 0101001, "aabaabaaa" => 010123452 }
*
* @param file
* File that need to be parsed to find the string pattern.
*
* @@return
* For a given file it return a map of lines numbers with all multiple char location(start) of pattern with in that line.
*
*/
def contains(pattern:Array[Char],prefixTable:Array[Int], file:String):LinkedHashMap[Integer, ArrayList[Integer]]= {
val pat:String = pattern.toString()
//stores a line and char location of each occurrence
var returnValue:LinkedHashMap[Integer, ArrayList[Integer]] = new LinkedHashMap[Integer, ArrayList[Integer]]()
val source = scala.io.Source.fromFile(file,"iso-8859-1")
val lines = try source.mkString finally source.close()
var lineNumber=1
var i=0
var k=0
var j=0
while(i < lines.length()){
if(lines(i)=='\n')
{lineNumber+=1;k=0; j=0}
var charAt = new ArrayList[Integer]();
while( j<pattern.length && i < lines.length() && lines(i)==pattern(j)){
j+=1
i+=1
k+=1
}
if(j==pattern.length){charAt.add(k-pattern.length+1);j=0}
if(j==0) {i+=1;k+=1}
else{j=prefixTable(j-1)}
if(charAt.size()>0){returnValue.put(lineNumber, charAt)}
}
return returnValue;
}
答案 0 :(得分:0)
使用此代码:
object HelloWorld {
def main(args: Array[String]) {
val name="""A""".r
val chaine="BCDARFA"
val res=name.findAllIn(chaine)
println("found?"+res)
println("1st place "+res.start)
}
}
你可以在一个字符串中找到正则表达式的第一个出现位置。我现在不知道它比你的更快,但无论如何它可以简化你的代码。
编辑: 这是最终的代码:
object HelloWorld {
def main(args: Array[String]) {
val name="""A""".r
val chaine="BCDARFA"
val res=name.findAllIn(chaine)
println("found?"+res)
println("1st place "+res.start)
for (elt <- res.matchData) {
println ("position : "+elt.start)
}
}
}