Question

我一直在尝试为我的作业创建一个内存中的字符串处理应用程序。所以，我认为将整个字符串加载到内存中，然后解析加载到内存中的字符串。

为此，我首先创建了一个字节串解析器，它与扫描器相同，但使用CharBuffer。（整个字符串被加载到内存中）但它甚至不是快速的基于磁盘的字符串解析器。

当时，我发现CharBuffer实现了Readable，所以我尝试使用这样的扫描器：

FileChannel channel = new FileInputStream(file).getChannel();
MappedByteBuffer mapped_buffer =
             channel.map(FileChannel.MapMode.READ_ONLY, 0, channel.size());
Charset charset = Charset.forName("US-ASCII");
CharsetDecoder decoder = charset.newDecoder();
CharBuffer buffer = decoder.decode(mapped_buffer);
Scanner sc = new Scanner(buffer).useDelimiter("\n");

但它与基于磁盘的扫描仪相似甚至更慢。基于磁盘的程序的示例代码如下：

File target = new File(target_path);

Scanner scan = new Scanner(target);
    while (scan.hasNext()) {
        line = scan.nextLine();
        ... }

每个人都认为内存处理比基于磁盘的处理快得多。为了达到上述性能，我应该考虑在内存中解析一个字符串？使用scanner来读取内存中的字符串数据是否合理？或者我使用的扫描仪是不是从内存中读取解析后的字符串行？

Answer 1

为什么要使用Scanner？扫描仪，CharsetDecoder等都会变慢。

特别是如果您正在阅读的只是ASCII，那么您根本不需要这些。

byte[] bytes = new byte[(int)file.length()];

FileInputStream in = new FileInputStream(file);
in.read(bytes);
in.close();

char[] text = new char[bytes.length];
for (int i = 0; i < bytes.length; i++) {
    text[i] = (char)(bytes[i] & 0xFF);
}

for (String line : new String(text).split("\n")) {
    //
}

UTF-16只是一个更复杂的额外步骤。

如果你想逐行阅读并不是那么复杂。我仍然建议不要使用像Scanner这样的东西。

StringBuilder line = new StringBuilder(1024);
FileInputStream in = new FileInputStream(file);

int next;

boolean lb = true;

while ((next = in.read()) != -1) {

    if (next == 0xD || next == 0xA) {

        // skip if there are multiple line breaks
        if (lb) continue;

        lb = true;
        sendNextLineSomewhere(line.toString());

        // avoid new object creations
        line.delete(0, line.length());

    } else {

        lb = false;
        line.append((char)next);
    }
}

in.close();

关于ASCII换行符的一个注意事项是有两个与之相关的字符。换行（0xA）和回车（0xD）。某些文本编辑器（例如Windows记事本）从两个字符的CR + LF组合中注册换行符。记住这一点只是一件事。如果你不把它带入帐户，你的文件来自这样的程序，你会得到空行。在输出端，如果你不想写一个CR + LF组合，你想要一个新的行程序，想要它不会正确读取文件。

Answer 2

这只是我解析String顺序的方法：

  byte[] msg = FileUtils.readFileToByteArray(file);
  ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(msg);

  InputSource saxInputSource= new InputSource(byteArrayInputStream);

  InputStream underlyingByteStream = saxInputSource.getByteStream();

  Reader reader = saxInputSource.getCharacterStream();

  StringBuffer segmentBuffer = new StringBuffer(512);

  // write a method for writing into the segmentBuffer
  int c = reader.read();

  while(c != -1){
      segmentBuffer.append((char) c);
     // do something.. in your case break; if \n appears
     c = readChar();
  }

对于凌乱的代码，这只是我的记忆。我希望你能理解这一点，用字符填充StringBuffer，直到达到换行符。之后，您可以处理该行并清除StringBuffer，再次调用该方法并填充它等等......只需询问您是否需要更多说明。

编辑：嗯，我真的不知道你真正想要实现的目标，但使用下面的代码在我的机器上花费0,389秒来读取100 MB的文件。它还将内容缓冲到内存中。

BufferedReader br = new BufferedReader(new FileReader(file));
      for (String line; (line = br.readLine()) != null; ) {
      }
      br.close();

是否有任何内存扫描程序类似于java.util.Scanner来读取一行字符串？

2 个答案: