Question

我正在使用Files.lines（...）读取一个非常大的（500mb）文件。它读取文件的一部分，但在某些时候它会破坏java.io.UncheckedIOException：java.nio.charset.MalformedInputException：输入长度= 1

我认为该文件包含不同字符集的行。有没有办法跳过这些断线？我知道返回的流是由Reader和读者支持的，我知道如何跳过，但不知道如何从流中获取Reader以便按我喜欢的方式设置它。

    List<String> lines = new ArrayList<>();
    try (Stream<String> stream = Files.lines(Paths.get(getClass().getClassLoader().getResource("bigtest.txt").toURI()), Charset.forName("UTF-8"))) {
        stream
            .filter(s -> s.substring(0, 2).equalsIgnoreCase("aa"))
            .forEach(lines::add);
    } catch (final IOException e) {
        // catch
    }

Answer 1

当预先配置的解码器已经停止解码并发生异常时，您无法在解码后过滤带有无效字符的行。您必须手动配置CharsetDecoder以告知它忽略无效输入或用特殊字符替换该输入。

CharsetDecoder dec=StandardCharsets.UTF_8.newDecoder() .onMalformedInput(CodingErrorAction.IGNORE); Path path=Paths.get(getClass().getClassLoader().getResource("bigtest.txt").toURI()); List<String> lines; try(Reader r=Channels.newReader(FileChannel.open(path), dec, -1); BufferedReader br=new BufferedReader(r)) { lines=br.lines() .filter(s -> s.regionMatches(true, 0, "aa", 0, 2)) .collect(Collectors.toList()); }

这会忽略字符集解码错误，跳过字符。要跳过包含错误的整行，您可以让解码器为错误插入替换字符（默认为'\ufffd'）并过滤掉包含该字符的行：

CharsetDecoder dec=StandardCharsets.UTF_8.newDecoder() .onMalformedInput(CodingErrorAction.REPLACE); Path path=Paths.get(getClass().getClassLoader().getResource("bigtest.txt").toURI()); List<String> lines; try(Reader r=Channels.newReader(FileChannel.open(path), dec, -1); BufferedReader br=new BufferedReader(r)) { lines=br.lines() .filter(s->!s.contains(dec.replacement())) .filter(s -> s.regionMatches(true, 0, "aa", 0, 2)) .collect(Collectors.toList()); }

Answer 2

在这种情况下，使用Streams API时，解决方案将变得复杂且容易出错。我建议只使用正常的for循环从BufferedReader读取，然后捕获MalformedInputException。这也可以区分其他IO异常的捕获：

List<String> lines = new ArrayList<>();

try (BufferedReader r = new BufferedReader(path,StandardCharsets.UTF_8)){
     try{
          String line = null;
          while((line=r.readLine())!=null){
               if(line.substring(0, 2).equalsIgnoreCase("aa")){
                    lines.add(line);
                }
     }catch(MalformedInputException mie){
           // ignore or do something
     }
}

Files.lines在Java8中跳过断行

2 个答案: