I have an Arabic subtitle I've trying to convert from SRT to VTT. The subtitles seems to be using windows-1256 according to the character encoding detector on ICU (Java). The final VTT file is on UTF-8.
The subtitle converts fine and it all looks right except for the punctuation moves from the left side to the right side. I am using this subtitle on the Chromecast so at first I thought it was an issue with the Chromecast but even gedit on Linux has the issue. However LibreOffice does not have the issue. Nor does the console output on IntelliJ.
I wrote a simple piece of code to recreate the issue without actually converting from SRT to VTT, just by converting from windows-1256 to UTF-8.
BufferedReader reader = new BufferedReader(
new InputStreamReader(new FileInputStream("arabic sub.srt"), "windows-1256")
);
String line = null;
BufferedWriter writer = new BufferedWriter(
new OutputStreamWriter(new FileOutputStream("bad punctuation.srt"), "UTF-8")
);
while((line = reader.readLine())!= null){
System.out.println(line);
writer.write(line);
writer.write("\r\n");
}
writer.close();
reader = new BufferedReader(
new InputStreamReader(new FileInputStream("bad punctuation.srt"), "UTF-8")
);
line = null;
while((line = reader.readLine())!= null){
System.out.println(line);
}
Here is the output from the IntelliJ console:
As you can see the dot is on the left side which I guess is correct.
Here is what gedit shows:
Most of the text is to the right which I guess is correct but the period is on the right, which I guess is wrong.
Here is LibreOffice:
Which is mostly correct, the punctuation is to the left, however the text is also on the left and I guess it should be on the right.
This is the subtitle I'm testing https://www.opensubtitles.org/en/subtitles/5168225/game-of-thrones-fire-and-blood-ar
I also tried a different SRT that was originally encoded as UTF-8 and that one worked fine without issues. So my guess is that the conversion from windows-1256 is the issue.
So what is the issue with the way I'm re-encoding the file?
Thanks.
Edit: Forgot a chromecast picture.
As you can see the punctuation is on the wrong side.
EDIT: I just noticed that Linux chardet
says it is MacCyrillic
not windows-1256
. But the Java ICU library says windows-1256
. Anyways, if I use MacCyrillic
then the punctuation looks fine on gEdit but the text itself doesn't look right, like it is now using garbage characters.
答案 0 :(得分:2)
查看原始字幕文件,我可以确定它是格式错误。即使以从左到右的字符集显示,全句似乎也出现在文本之前。我相信正确的字符集是windows-1256。
这将正确显示的唯一方法是,如果行的开头的标点符号显示为LTR,而行的其余部分显示为RTL。您可以尝试通过在标点符号后添加UTF-8从左到右的标记来强制执行此操作。
如果您希望修改原始文件,则需要将任何标点符号从行的开头移动到结尾。该行开头的括号也需要反转。
答案 1 :(得分:0)
由于编码与文本方向无关(LTR与RTL),我认为您应该利用专门为此目的创建的UTF-8标记。
简而言之:文本文件没有文本方向的信息,它只是一个文本文件。