PHP和antiword无法正确解析西里尔文本

时间:2011-12-02 08:31:27

标签: php character-encoding internationalization ms-office

我正在尝试使用Linux服务器上的antiword解析MS Office 2003文档。但它不能正确解析西里尔文本。

它返回如下内容:

??? ???? ???????????

有没有人知道如何正确解析包含西里尔文本的MS Office 2003文档?

2 个答案:

答案 0 :(得分:1)

我用西里尔文解决了这个问题

您可能会看到here

的良好文档

工作代码如下:

$content = shell_exec('/usr/bin/antiword -m cp1251.txt '.$filename);
var_dump($content);

注意param -m(字符映射文件)

您忘记设置正确的映射文件


一段doc关注映射文件:

Q9: Which mapping file (-m option) is correct in my situation?
A9: The correct mapping file depends on the character set you need for output
    in a specific language.
    For Western European languages (like English, French, German) this is
    8859-1.txt. (OS/2: cp1252.txt) (DOS: cp850.txt)
    For Eastern European languages (like Polish, Czech, Slovak, Croatian) this
    is 8859-2.txt. (OS/2: cp1250.txt) (DOS: cp852.txt)
    For Esperanto use 8859-3.txt.
    For Russian use 8859-5.txt or koi8-r.txt. (OS/2: cp1251.txt)
     (DOS: cp866.txt)
    For Ukrainian use koi8-u.txt.
    For Arabic use 8859-6.txt. (DOS: cp864.txt)
    For Hebrew use 8859-8.txt. (DOS: cp862.txt)
    For Thai use 8859-11.txt.
    If your system supports it, you might also try UTF-8.txt.

    NOTE: UTF-8 also enables Antiword to show text in languages like Chinese,
          Japanese and Korean.

答案 1 :(得分:0)

Antiword有一个编码参数,也许你试一试:

 shell_exec('antiword -X UTF-8 test.doc')

或者使用koi8-r,然后通过iconv()

转换为php

或者在cmdline模式下尝试LibreOffice

 shell_exec('soffice --headless --convert-to txt test.doc')