从文本中读取srt文件时的奇怪字符

时间:2015-11-30 16:32:06

标签: java unicode

我尝试将文件读取为字符串,我尝试将编码设置为UTF-8但仍然失败,它会在输出中返回一些奇怪的字符。

这是我阅读文件的功能:

private static String readFile(String path, boolean isRaw) throws UnsupportedEncodingException, FileNotFoundException{
    File fileDir = new File(path);
try{    
    BufferedReader in = new BufferedReader(
       new InputStreamReader(
                  new FileInputStream(fileDir), "UTF-8"));

    String str;

    while ((str = in.readLine()) != null) {
        System.out.println(str);
    }

            in.close();
            return str;
    } 
    catch (UnsupportedEncodingException e) 
    {
        System.out.println(e.getMessage());
    } 
    catch (IOException e) 
    {
        System.out.println(e.getMessage());
    }
    catch (Exception e)
    {
        System.out.println(e.getMessage());
    }
    return null;
}

第一行的输出是: 1

这是我的测试文件https://www.dropbox.com/s/2linqmdoni77e5b/How.to.Get.Away.with.Murder.S01E01.720p.HDTV.X264-DIMENSION.srt?dl=0

提前致谢。

2 个答案:

答案 0 :(得分:3)

此文件以UTF16-LE编码,并具有Byte order mark,有助于确定编码。使用"UTF-16LE"字符集(或StandardCharsets.UTF_16LE)并跳过文件的第一个字符(例如,在第一行调用str.substring(1))。

答案 1 :(得分:1)

您的文件看起来像是一个BOM文件。如果您不需要处理BOM字符,请打开notepad ++并将文件编码为UTF-8而不使用BOM

要在java中处理BOM文件,请查看此apache site for BOMInputStream

示例:

private static String readFile(String path, boolean isRaw) throws UnsupportedEncodingException, FileNotFoundException{
File fileDir = new File(path);

try{
    BOMInputStream bomIn = new BOMInputStream(new FileInputStream(fileDir), ByteOrderMark.UTF_16LE);

    //You can also detect UTF-8, UTF-16BE, UTF-32LE, UTF-32BE by using this below constructure
    //BOMInputStream bomIn = new BOMInputStream(new FileInputStream(fileDir), ByteOrderMark.UTF_16LE, 
    //      ByteOrderMark.UTF_16BE, ByteOrderMark.UTF_32LE, ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_8);

    if(bomIn.hasBOM()){
        System.out.println("Input file was encoded as a bom file, the bom character has been removed");
    }

    BufferedReader in = new BufferedReader(
       new InputStreamReader(
                  bomIn, "UTF-8"));

    String str;

    while ((str = in.readLine()) != null) {
        System.out.println(str);
    }

    in.close();
    return str;
} 
catch (UnsupportedEncodingException e) 
{
    System.out.println(e.getMessage());
} 
catch (IOException e) 
{
    System.out.println(e.getMessage());
}
catch (Exception e)
{
    System.out.println(e.getMessage());
}
return null;
}