最好的方法是读取文件内容并在给定的文件列表中查找模式

时间:2015-08-21 12:32:08

标签: scala scala-collections

我有一个文件名列表(近40万)。我需要解析每个文件的内容并找到给定的字符串模式。

任何人都可以帮助我提高搜索过程的最佳方式(我能够在90秒内处理内容)。

以下是需要优化的代码段。

/**
* This method is called over a list of files and file is parsed char by char and compared with pattern using prefix table( used in KMP algorithm).
* 
* @param pattern
*     Pattern to be searched
*  
* @param prefixTable
*     Prefix table is build is using KMP algorithm.
*     Example:- For a given pattern => results sets are { "ababaca" => 0012301, "abcdabca" => 00001231, "aababca" => 0101001, "aabaabaaa" => 010123452 }     
*    
*  @param file
*     File that need to be parsed to find the string pattern.
*  
*  @@return
*     For a given file it return a map of lines numbers with all multiple char location(start) of pattern with in that line.   
*     
*/



  def contains(pattern:Array[Char],prefixTable:Array[Int], file:String):LinkedHashMap[Integer, ArrayList[Integer]]= {
val pat:String = pattern.toString()
//stores a line and char location of each occurrence 
    var returnValue:LinkedHashMap[Integer, ArrayList[Integer]] = new LinkedHashMap[Integer, ArrayList[Integer]]()

    val source = scala.io.Source.fromFile(file,"iso-8859-1")

      val lines = try source.mkString finally source.close()
            var lineNumber=1
            var i=0
            var k=0
            var j=0
            while(i < lines.length()){
                if(lines(i)=='\n')
                {lineNumber+=1;k=0; j=0}
                var charAt = new ArrayList[Integer]();
                while( j<pattern.length && i < lines.length() && lines(i)==pattern(j)){
                    j+=1        
                    i+=1
                    k+=1
                }
                if(j==pattern.length){charAt.add(k-pattern.length+1);j=0}
                if(j==0) {i+=1;k+=1}
                else{j=prefixTable(j-1)}
                if(charAt.size()>0){returnValue.put(lineNumber, charAt)}
            }
    return returnValue;
}

1 个答案:

答案 0 :(得分:0)

使用此代码:

object HelloWorld {
  def main(args: Array[String]) {

    val name="""A""".r
    val chaine="BCDARFA"

    val res=name.findAllIn(chaine)
    println("found?"+res)

    println("1st place "+res.start)

  }
}

你可以在一个字符串中找到正则表达式的第一个出现位置。我现在不知道它比你的更快,但无论如何它可以简化你的代码。

编辑: 这是最终的代码:

object HelloWorld {
  def main(args: Array[String]) {

    val name="""A""".r
    val chaine="BCDARFA"

    val res=name.findAllIn(chaine)
    println("found?"+res)

    println("1st place "+res.start)

    for (elt <- res.matchData) {
      println ("position : "+elt.start)
    }

  }
}