Question

我正在尝试阅读rtf文件＆amp;提取其中的字符。例如。下面是ф

的rtf版本

{\ rtf1 \ ansi \ ansicpg1252 \ fromtext \ fbidis \ deff0 {\ fonttbl {\ f0 \ fswiss \ fcharset0 Arial;} {\ f1 \ fmodern Courier New;} {\ f2 \ fnil \ fcharset2符号;} {\ f3 \ fmodern \ fcharset0 Courier New;} {\ f4 \ fswiss \ fcharset204 Arial;}} {\ colortbl \ red0 \ green0 \ blue0; \ red0 \ green0 \ blue255;} \ uc1 \ pard \ plain \ deftab360 \ f0 \ fs20 \ htmlrtf {\ f4 \ fs20 \ htmlrtf0 \'f4 \ htmlrtf \ f0} \ htmlrtf0 \ par}

正如您所看到的，这里的编码是Windows-1252

#!/usr/bin/perl
use strict;
use utf8;
use Encode qw(decode encode);

binmode(STDOUT, ":utf8");
my $runtime = chr(0x0444);
   print "theta || ".$runtime." ||";

  my $hexstr = "0xF4";
  my $num = hex $hexstr;
  my $be_num = pack("N", $num);
  $runtime = decode( "cp1252",$be_num);
  print "\n".$runtime."\n";

$runtime = decode( "cp1251",$be_num);
  print "\n".$runtime."\n"

输出

theta || ф ||
ô

ф

正如你可以看到cp1252我得到了ô。我错过了什么吗？我想从rtf获得编码。我打算打印ф，但它打印ô

Answer 1

虽然该文档的全局代码页是cp1252，但有本地定义：

\ xf4字符使用字体f4：{\f4...\'f4编写。
但字体f4的定义是：{\f4\fswiss\fcharset204 Arial;}
\ fcharset204将此字体的字符集设置为204，例如俄语，代码页1251（根据http://msdn.microsoft.com/en-us/library/cc194829.aspx）

使用代码页1251，您将获得预期的字符ф。

BTW，代码页1252类似于latin-1，根本没有字符ф（见http://en.wikipedia.org/wiki/Windows-1252）

rtf文件中的编码类型错误

1 个答案: