java.util.Scanner读取具有不同字符编码的文件

时间:2018-11-06 12:12:15

标签: java arrays character-encoding java.util.scanner

我使用ANSI来读取文件列表。其中一些具有不同的编码,UTF-8而不是java.util.Scanner FileInputStream fis = new FileInputStream(my_file); BufferedReader br = new BufferedReader(new InputStreamReader(fis)); InputStreamReader isr = new InputStreamReader(fis); isr.getEncoding(); 无法读取这些文件并获得空的输出字符串。 我尝试了另一种方法:

ANSI

对于Scanner,我不确定如何更改字符编码。 UTF-8和ANSI文件混合在同一文件夹中。我尝试为此使用Apache Tika。 对文件进行编码后,我使用Scanner scanner = new Scanner(my_file, detector.getCharset().toString()); line = scanner.nextLine(); ,但输出为空。

{{1}}

3 个答案:

答案 0 :(得分:1)

有一个名为juniversalchardet的库,可以帮助您猜测正确的编码。它最近更新了,当前位于GitHub上:

https://github.com/albfernandez/juniversalchardet

但是,由于存在许多未知的事物,因此没有故障安全工具可以检测编码:

  1. 此文件是全部还是部分PNG文本?
  2. 它是以(1,...,k,...,n)位编码存储的吗?
  3. 使用了哪种k位编码?

可以通过计算不常用的控制字符数量来进行一些猜测。当文件包含许多控制符号时,可能是您选择了错误的编码。 (然后尝试下一个。)

Juniversalchardet尝试了多种甚至更成功的方法来确定编码(甚至中文)。它还提供了方便的方法,可以从已选择正确编码的文件打开阅读器:

(摘录自https://github.com/albfernandez/juniversalchardet#creating-a-reader-with-correct-encoding的摘录)

import org.mozilla.universalchardet.ReaderFactory;
import java.io.File;
import java.io.IOException;
import java.io.Reader;

public class TestCreateReaderFromFile {

    public static void main (String[] args) throws IOException {
        if (args.length != 1) {
            System.err.println("Usage: java TestCreateReaderFromFile FILENAME");
            System.exit(1);
        }

        Reader reader = null;
        try {
            File file = new File(args[0]);
            reader = ReaderFactory.createBufferedReader(file);

            String line;
            while((line=reader.readLine())!=null){
                System.out.println(line); //Print each line to console
            }
        }
        finally {
            if (reader != null) {
                reader.close();
            }
        }

    }

}

编辑:添加了ScannerFactory

/*
(C) Copyright 2016-2017 Alberto Fernández <infjaf@gmail.com>
Adapted by Fritz Windisch 2018-11-15
The contents of this file are subject to the Mozilla Public License Version
1.1 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.mozilla.org/MPL/
Software distributed under the License is distributed on an "AS IS" basis,
WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License
for the specific language governing rights and limitations under the
License.
Alternatively, the contents of this file may be used under the terms of
either the GNU General Public License Version 2 or later (the "GPL"), or
the GNU Lesser General Public License Version 2.1 or later (the "LGPL"),
in which case the provisions of the GPL or the LGPL are applicable instead
of those above. If you wish to allow use of your version of this file only
under the terms of either the GPL or the LGPL, and not to allow others to
use your version of this file under the terms of the MPL, indicate your
decision by deleting the provisions above and replace them with the notice
and other provisions required by the GPL or the LGPL. If you do not delete
the provisions above, a recipient may use your version of this file under
the terms of any one of the MPL, the GPL or the LGPL.
*/

import java.io.BufferedInputStream;
import java.io.File;
import java.io.IOException;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.Objects;
import java.util.Scanner;
import org.mozilla.universalchardet.UniversalDetector;
import org.mozilla.universalchardet.UnicodeBOMInputStream;

/**
 * Create a scanner from a file with correct encoding
 */
public final class ScannerFactory {

    private ScannerFactory() {
        throw new AssertionError("No instances allowed");
    }
    /**
     * Create a scanner from a file with correct encoding
     * @param file The file to read from
     * @param defaultCharset defaultCharset to use if can't be determined
     * @return Scanner for the file with the correct encoding
     * @throws java.io.IOException if some I/O error ocurrs
     */

    public static Scanner createScanner(File file, Charset defaultCharset) throws IOException {
        Charset cs = Objects.requireNonNull(defaultCharset, "defaultCharset must be not null");
        String detectedEncoding = UniversalDetector.detectCharset(file);
        if (detectedEncoding != null) {
            cs = Charset.forName(detectedEncoding);
        }
        if (!cs.toString().contains("UTF")) {
            return new Scanner(file, cs.name());
        }
        Path path = file.toPath();
        return new Scanner(new UnicodeBOMInputStream(new BufferedInputStream(Files.newInputStream(path))), cs.name());
    }
    /**
     * Create a scanner from a file with correct encoding. If charset cannot be determined,
     * it uses the system default charset.
     * @param file The file to read from
     * @return Scanner for the file with the correct encoding
     * @throws java.io.IOException if some I/O error ocurrs
     */
    public static Scanner createScanner(File file) throws IOException {
        return createScanner(file, Charset.defaultCharset());
    }
}

答案 1 :(得分:0)

您的方法不会为您提供正确的编码。

 FileInputStream fis = new FileInputStream(my_file);
 BufferedReader br = new BufferedReader(new InputStreamReader(fis));
 InputStreamReader isr = new InputStreamReader(fis);
 isr.getEncoding();

这将返回此InputStream使用的编码(读取javadoc),而不是文件中写入的字符(在您的情况下为my_file)。而且,如果编码不正确,扫描程序将无法正确读取文件。

实际上,如果我错了,请纠正我,没有办法以100%的准确性获取用于特定文件的编码。很少有项目在猜测编码时有较高的成功率,但没有100%的准确性。另一方面,如果您知道所使用的编码,则可以使用

读取文件
Scanner scanner = new Scanner(my_file, "charset");
scanner.nextLine();

此外,找出在Java中用于ANSI的正确字符集名称。它是US-ASCII或Cp1251。

无论走哪条路,请注意任何可能指向正确方向的IOException

答案 2 :(得分:0)

要使Scanner可以使用不同的编码,您必须向扫描仪的构造函数提供正确的编码。

要定义文件编码,最好使用外部库(例如https://github.com/albfernandez/juniversalchardet)。但是,如果您绝对知道可能的编码,则可以根据Wikipedia

进行手动检查
public static void main(String... args) throws IOException {
    List<String> lines = readLinesFromFile(new File("d:/utf8.txt"));
}

public static List<String> readLinesFromFile(File file) throws IOException {
    try (Scanner scan = new Scanner(file, getCharsetName(file))) {
        List<String> lines = new LinkedList<>();

        while (scan.hasNext())
            lines.add(scan.nextLine());

        return lines;
    }
}

private static String getCharsetName(File file) throws IOException {
    try (InputStream in = new FileInputStream(file)) {
        if (in.read() == 0xEF && in.read() == 0xBB && in.read() == 0xBF)
            return StandardCharsets.UTF_8.name();
        return StandardCharsets.US_ASCII.name();
    }
}