Question

我正在通过java-8读取文件并使用以下方法找到该文件中的键的匹配项：

private static List<String> listFilesWithMatches(String[] listOfIncludedFiles, Map<String, String> myPropMapKeys) {

    List<String> mapKeyList = new ArrayList<String>(myPropMapKeys.keySet());
    List<String> matchFileList = new ArrayList<>();
    Predicate<String> p = (str) -> mapKeyList.stream().anyMatch(key -> str.contains(utf8AsLatin1(key)));



    for(String myFile : listOfIncludedFiles){
         try (Stream<String> stream = Files.lines(Paths.get(myFile))) {
             boolean foundAKey = stream.anyMatch(p);
                if(foundAKey) {
                    matchFileList.add(myFile);

                    //Listing the files that have match
                    System.out.println("**"+ROOT_PATH + File.separator + myFile);
                }
         }

         catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
    System.out.println("Total Number of files with matches:: "+matchFileList.size());
    return matchFileList;
}

private static String utf8AsLatin1(String key) {
     return new String(key.getBytes(StandardCharsets.ISO_8859_1),
             StandardCharsets.UTF_8);
}

现在我收到以下错误

Exception in thread "main" java.io.UncheckedIOException: java.nio.charset.MalformedInputException: Input length = 1
    at java.io.BufferedReader$1.hasNext(Unknown Source)
    at java.util.Spliterators$IteratorSpliterator.tryAdvance(Unknown Source)
    at java.util.stream.ReferencePipeline.forEachWithCancel(Unknown Source)
    at java.util.stream.AbstractPipeline.copyIntoWithCancel(Unknown Source)
    at java.util.stream.AbstractPipeline.copyInto(Unknown Source)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(Unknown Source)
    at java.util.stream.MatchOps$MatchOp.evaluateSequential(Unknown Source)

创建问题的行是

try (Stream<String> stream = Files.lines(Paths.get(myFile))) {
                 boolean foundAKey = stream.anyMatch(p);

现在解决方案之一是使用ISO_8859_1字符集，但我只想要文件的默认编码，并且不想使用任何其他字符集。有人可以帮忙解决这个问题吗？

Answer 1

我很高兴你喜欢UTF-8。但基本上它说UTF-8中的文件是 而不是 。因此，请使用文件路径记录错误，然后继续。手动修复文件，然后重新提交。

这可能不可行。

或者使用无失败单字节编码ISO-8859-1，或者更好的平台默认编码：

Files.lines(Paths.get(myFile), StandardCharset.ISO_8859_1)

static String utf8AsLatin1(String s) {
    return new String(key.getBytes(StandardCharset.ISO_8859_1),
                      StandardCharsets.UTF_8);
}

Predicate<String> p = (str) ->
    mapKeyList.stream().anyMatch(key ->
        str.containsIgnoreCase(utf8AsLatin1(key))
        || str.containsIgnoreCase(key));

理论上使用 Charset.Decoder 进行错误处理会更合适。但是键可以有特殊的字符。

以上将尝试编码ISO-8859-1。即使是拉丁语欧洲语言，这还不够。

可以将密钥转换为使用通配符序列.{1,6}替换特殊字符的正则表达式模式，并进行正则表达式匹配。

文字规范化是另一个问题：在正确编码的文字上使用java.text.Normalizer。 à可以是一个Unicode符号（代码点）或两个符号a和一个零宽度组合变音符号（`）。

对于搜索，您可以分解文本并删除变音标记。波兰语抚摸l ł和土耳其语点缀我İ和无点i ı仍然存在一些问题。

更明智的解决方案

for(String myFile : listOfIncludedFiles){
     Path path = Paths.get(myFile);
     try (Stream<String> stream = Files.lines(path,
             determineCharset(path))) {


Charset determineCharset(Path path) {
    byte[] bytes = Files.readAllBytes(path);
    for (int i = 0; i < bytes.length; ++i) {
        byte b = bytes[i];
        if (b == 0) {
            return i % 2 == 0
                    ? StandardCharsets.UTF_16
                    : StandardCharsets.UTF_16LE;
        }
        if (b < 0) {
            int high1s= 0; // Length of byte sequence
            while ((b & 0x80) == 0x80) {
                ++high1s;
                b = (byte)(b << 1);
            } 
            if (high1s == 1 || i + high1s > bytes.length) {
                // A UTF-8 continuation byte
                // cannot be at the start.
                // Or not sufficient room for
                // continuation bytes
                return Charset.defaultCharset()
                    .equals(StandardCharsets.UTF_8)
                    ? StandardCharsets.ISO_8859_1
                    : Charset.defaultCharset();
            }
            int contBytes = high1s - 1;
            // Skip continuation bytes
            while (i + 1 < bytes.length
                    && (bytes[i+1] & 0b1100_0000)
                       == 0b1000_0000) {
                 ++i;
                 --contBytes;
            }
            if (contBytes != 0) {
                return Charset.defaultCharset()
                    .equals(StandardCharsets.UTF_8)
                    ? StandardCharsets.ISO_8859_1
                    : Charset.defaultCharset();
            }
        }
    }
    return StandardCharsets.UTF_8;
}

这会检查UTF-8合规性（当然可以更好地编写）。什么时候不给它平台编码。当该平台编码为UTF-8然后是Latin-1。

这是一个非常有限的解决方案。

Java 8 - java.nio.charset.MalformedInputException

1 个答案: