我有一个srt文件,摘录:
2
00:00:36,208 --> 00:00:39,667
Èá óå óêïôþóù, ÃïõÜéíôæåëóôéí!
3
00:00:57,917 --> 00:01:00,917
Ãéáôß ôñÝ÷åéò, ÃïõÜéíôæåëóôéí;
Óïõ ðÞñá äþñï ãåíåèëßùí.
4
00:01:00,958 --> 00:01:03,208
Äåí ðåéñÜæåé, äåí ÷ñåéáæüôáí
íá ìïõ ðÜñåéò êÜôé.
5
00:01:03,250 --> 00:01:06,375
Óïõ ðÞñá ëßãï êïñìü äÝíôñïõ.
Êáé èá ôï öáò.
6
00:01:06,417 --> 00:01:08,875
Ùñáßá. ¸ôóé êé áëëéþò
èá Ýôñùãá êïñìü.
7
00:01:08,917 --> 00:01:10,208
Äåí èá Ýôñùãåò.
8
00:01:10,208 --> 00:01:11,000
Íáé. ÂëÝðåéò...
9
00:01:11,000 --> 00:01:12,417
...üëá ôá ðñÜãìáôá ðïõ Þèåëåò
íá ìïõ êÜíåéò...
10
00:01:12,417 --> 00:01:13,958
...ó÷åäßáæá íá ôá êÜíù ìüíïò ìïõ.
据说这些是日文字幕,但显然它是编码问题的乱码。我试图弄清楚如何纠正它并最终转换为UTF-8。有人有什么想法吗?
文件输出:UTF-8 Unicode(带BOM)文本,带CRLF行终止符
可以在这里获取文件进行测试: http://www.opensubtitles.org/en/subtitles/5040215/the-incredible-burt-wonderstone-ja
答案 0 :(得分:3)
您所拥有的是已从ISO-8859-1字符集转码为UTF-8编码方案的文档,但文档源是以ISO-8859-7字符集编码的。转码为UTF-8后,添加了U + FEFF字节顺序标记(BOM)和几个引号(U + 201C,U + 201D)。
语言为希腊语,纠正后的第二个字幕序列为:
2
00:00:36,208 --> 00:00:39,667
Θα σε σκοτώσω, Γουάιντζελστιν!
英文翻译为“I'll kill you, Gouaintzelstin!”。
要反转/纠正它:
Perl中的上述实现:
#!/usr/bin/perl
use strict;
use warnings;
use Encode qw[];
(@ARGV == 1 && -f $ARGV[0])
or die qq[Usage: $0 <file>];
my $file = shift @ARGV;
my ($octets, $string);
# Read all the octets from the file
$octets = do {
open my $fh, '<:raw', $file
or die qq[Could not open '$file' for reading: '$!'];
local $/; <$fh>
};
# Decode the octets using the UTF-8 encoding scheme
$string = Encode::decode('UTF-8', $octets, Encode::FB_CROAK);
# Remove all code points greater than U+00FF
$string =~ s/[^\x00-\xFF]//g;
# Encode the string using the ISO-8859-1 encoding
$octets = Encode::encode('ISO-8859-1', $string);
# Decode the octets using the ISO-8859-7 encoding
$string = Encode::decode('ISO-8859-7', $octets);
# Encode the string using the UTF-8 encoding
$octets = Encode::encode('UTF-8', $string);
# Output the octets on standard output
print $octets;