Question

我正在尝试从文件中读取一些法语字符，但如果字母包含àéè，则会出现一些符号。任何人都可以指导我如何获得文件的实际字符。这是我的主要方法

public static void main(String args[]) throws IOException

    {
    char current,org;

    //String strPath = "C:/Documents and Settings/tidh/Desktop/BB/hhItem01_2.txt";

    String strPath = "C:/Documents and Settings/tidh/Desktop/hhItem01_1.txt";
    InputStream fis;

    fis = new BufferedInputStream(new FileInputStream(strPath));

    while (fis.available() > 0) {
    current= (char) fis.read(); // to read character
                                                            // from file
                            int ascii = (int) current; // to get ascii for the
                                                        // character
                            org = (char) (ascii);
                            System.out.println(org);
    }

Answer 1

您尝试使用ASCII实际读取UTF-8字符。以下是如何实现您的功能的示例：

public class Test {
    private static final FILE_PATH = "c:\\temp\\test.txt";
    public static void main(String[] args){

    try {
        File fileDir = new File(FILE_PATH);

        BufferedReader in = new BufferedReader(
           new InputStreamReader(
                      new FileInputStream(fileDir), "UTF8"));

        String str;

        while ((str = in.readLine()) != null) {
            System.out.println(str);
        }

                in.close();
        } 
        catch (UnsupportedEncodingException e) 
        {
            System.out.println(e.getMessage());
        } 
        catch (IOException e) 
        {
            System.out.println(e.getMessage());
        }
        catch (Exception e)
        {
            System.out.println(e.getMessage());
        }
    }
}

参考：How to read UTF-8 encoded data from a file

Answer 2

您可以为Apache Commons IO下载一个jar文件，并尝试通过读取每一行来实现它，而不是通过char读取char。

MultiMaterial

Answer 3

以下假设文本在Windows Latin-1中，但我添加了UTF-8。

private static final String FILE_PATH = "c:\\temp\\test.txt";

Path path = Paths.get(FILE_PATH);
//Charset charset = StandardCharset.ISO_8859_1;
//Charset charset = StandardCharset.UTF_8;
Charset charset = Charset.forName("Windows-1252");
try (BufferedReader in = Files.newBufferedReader(path, charset)) {
    String line;
    while ((line = in.readLine()) != null) {
        System.out.println(line);
    }
}

字符串line将包含Unicode中的文本。现在，它取决于System.out是否可以使用Unicode转换来表示系统编码中的Unicode。

System.out.println("My encoding is: " + System.getProperty("file.encoding"));

但是，如果您选择了正确的编码，则每个特殊字符最多只有一个?。如果您看起来更符合特殊字符，请使用UTF-8 - 多字节编码。

为控制台选择支持Unicode的字体。

检查已获得é的是：

String e = "\u00e9";
String s = new String(Files.readAllBytes(path), charset);
System.out.println("Contains e´ : " + s.contains(e));

评论后：

最好使用Files.newBufferedReader （我在上面更正过），因为它可以做到以下几点。

try (BufferedReader in = new BufferedReader(
         new InputStreamReader(
             new FileInputStream(file), charset))) {

此缓冲区用于更快的读取，InputStreamReader使用二进制数据InputStream和charset将其转换为Reader的（Unicode）。

Answer 4

IBM提供的法语给出的特定编码是CP1252（首选，因为在所有操作系统上运行）。

此致

一个法国人

如何使用BufferedInputStream读取法语字符

4 个答案: