Converting from windows-1256 to UTF-8 causes punctuation issue

时间:2017-06-12 17:01:48

标签: java character-encoding right-to-left google-cast bidi

I have an Arabic subtitle I've trying to convert from SRT to VTT. The subtitles seems to be using windows-1256 according to the character encoding detector on ICU (Java). The final VTT file is on UTF-8.

The subtitle converts fine and it all looks right except for the punctuation moves from the left side to the right side. I am using this subtitle on the Chromecast so at first I thought it was an issue with the Chromecast but even gedit on Linux has the issue. However LibreOffice does not have the issue. Nor does the console output on IntelliJ.

I wrote a simple piece of code to recreate the issue without actually converting from SRT to VTT, just by converting from windows-1256 to UTF-8.

BufferedReader reader = new BufferedReader(
    new InputStreamReader(new FileInputStream("arabic sub.srt"), "windows-1256")
);
String line = null;
BufferedWriter writer = new BufferedWriter(
    new OutputStreamWriter(new FileOutputStream("bad punctuation.srt"), "UTF-8")
);
while((line = reader.readLine())!= null){
    System.out.println(line);
    writer.write(line);
    writer.write("\r\n");
}
writer.close();
reader = new BufferedReader(
    new InputStreamReader(new FileInputStream("bad punctuation.srt"), "UTF-8")
);
line = null;

while((line = reader.readLine())!= null){
    System.out.println(line);
}

Here is the output from the IntelliJ console:

Intellij Console

As you can see the dot is on the left side which I guess is correct.

Here is what gedit shows:

gEdit

Most of the text is to the right which I guess is correct but the period is on the right, which I guess is wrong.

Here is LibreOffice:

enter image description here

Which is mostly correct, the punctuation is to the left, however the text is also on the left and I guess it should be on the right.

This is the subtitle I'm testing https://www.opensubtitles.org/en/subtitles/5168225/game-of-thrones-fire-and-blood-ar

I also tried a different SRT that was originally encoded as UTF-8 and that one worked fine without issues. So my guess is that the conversion from windows-1256 is the issue.

So what is the issue with the way I'm re-encoding the file?

Thanks.

Edit: Forgot a chromecast picture.

enter image description here

As you can see the punctuation is on the wrong side.

EDIT: I just noticed that Linux chardet says it is MacCyrillic not windows-1256. But the Java ICU library says windows-1256. Anyways, if I use MacCyrillic then the punctuation looks fine on gEdit but the text itself doesn't look right, like it is now using garbage characters.

2 个答案:

答案 0 :(得分:2)

查看原始字幕文件,我可以确定它是格式错误。即使以从左到右的字符集显示,全句似乎也出现在文本之前。我相信正确的字符集是windows-1256。

这将正确显示的唯一方法是,如果行的开头的标点符号显示为LTR,而行的其余部分显示为RTL。您可以尝试通过在标点符号后添加UTF-8从左到右的标记来强制执行此操作。

如果您希望修改原始文件,则需要将任何标点符号从行的开头移动到结尾。该行开头的括号也需要反转。

答案 1 :(得分:0)

由于编码与文本方向无关(LTR与RTL),我认为您应该利用专门为此目的创建的UTF-8标记。

  • 从左到右标记:或(U + 200E)
  • 从右到左标记:或(U + 200F)

简而言之:文本文件没有文本方向的信息,它只是一个文本文件。

比照。 https://www.w3.org/TR/WCAG-TECHS/H34.html