Question

我想在文本文件中找到“$$$$”模式的实例数。以下方法适用于某些文件，但不适用于所有文件。例如，它不适用于以下文件（http://www.hmdb.ca/downloads/structures.zip - 它是一个扩展名为.sdf的压缩文本文件）我无法弄清楚为什么？我也试图逃避空白。没运气。当有超过35000个“$$$$”模式时，它返回11。请注意，速度至关重要。因此，我不能使用任何较慢的方法。

public static void countMoleculesInSDF(String fileName)
{
    int tot = 0;
    Scanner scan = null;
    Pattern pat = Pattern.compile("\\$\\$\\$\\$");

    try {  
        File file = new File(fileName);
        scan = new Scanner(file);
        long start = System.nanoTime();
        while (scan.findWithinHorizon(pat, 0) != null) {
            tot++;
        }
        long dur = (System.nanoTime() - start) / 1000000;
        System.out.println("Results found: " + tot + " in " + dur + " msecs");
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        scan.close();
}
}

Answer 1

对于您发布的链接文件和代码，我总是有218次匹配。这当然是不正确的：使用notepad ++的count函数进行验证，该文件应包含41498个匹配项。所以Scanner（我认为）有些不对劲，并在最后一场比赛结束时开始调试，即当扫描仪告知没有剩下的比赛时。这样做我在它的私有方法readInput()中遇到了一个异常，它没有被直接抛出，而是保存在一个语言环境变量中。

try {
    n = source.read(buf);
} catch (IOException ioe) {
    lastException = ioe;
    n = -1;
}

可以使用Scanner#ioException()方法检索此异常：

IOException ioException = scanner.ioException();
if (ioException != null) {
    ioException.printStackTrace();
}

打印此例外后显示some input could not be decoded

java.nio.charset.UnmappableCharacterException: Input length = 1
    at java.nio.charset.CoderResult.throwException(CoderResult.java:278)
    at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:338)
    at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
    at java.io.Reader.read(Reader.java:100)
    at java.util.Scanner.readInput(Scanner.java:849)

所以我只是尝试将一个字符集传递给Scanner的构造函数：

scan = new Scanner(file, "utf-8");

它使它成功！

Results found: 41498 in 2431 msecs

所以问题是Scanner使用了平台的字符集，不适合完全解码你的文件。

故事的道德：

在处理文本时始终明确传递字符集。
使用IOException时检查Scanner。

PS：引用字符串以使用正则表达式

Pattern pat = Pattern.compile("\\Q$$$$\\E");

或

Pattern pat = Pattern.compile(Pattern.quote("$$$$"));

Answer 2

这是我最终做的......（在你发布答案之前）。这种方法似乎比扫描仪更快。你会建议什么实施？扫描仪或内存映射？大文件的内存映射是否会失败？不确定..

private static final Charset CHARSET = Charset.forName("ISO-8859-15");
private static final CharsetDecoder DECODER = CHARSET.newDecoder();

public static int getNoOfMoleculesInSDF(String fileName) 
    {   
    int total=0;
    try
    {    
    Pattern endOfMoleculePattern = Pattern.compile("\\$\\$\\$\\$");
    FileInputStream fis = new FileInputStream(fileName);
    FileChannel fc = fis.getChannel();
    int fileSize = (int) fc.size();
    MappedByteBuffer mbb = fc.map(FileChannel.MapMode.READ_ONLY, 0, fileSize);
    CharBuffer cb = DECODER.decode(mbb);
    Matcher matcher = endOfMoleculePattern.matcher(cb);
    while (matcher.find()) {
      total++;
    }
    }
    catch(Exception e)
    {
        LOGGER.error("An error occured while counting molecules in the SD file");
    }
    return total;
    }

FindWithinHorizon无法匹配

2 个答案:

FindWithinHorizo​​n无法匹配

2 个答案:

FindWithinHorizon无法匹配