有没有办法在PHP中读取类似于Docx的Doc文件?

时间:2013-11-12 06:22:15

标签: php parsing document

我能够提取Docx文件的文本内容,我想对Doc文件执行相同的操作。我尝试使用相同的代码但无法读取任何内容。我想原因是“Doc格式不是压缩档案。”这是代码:

  function readDocx ($filePath) 
    {


        // Create new ZIP archive

        $zip = new ZipArchive;
        $dataFile = 'word/document.xml';
        // Open received archive file
        if (true === $zip->open($filePath)) {
            // If done, search for the data file in the archive
            if (($index = $zip->locateName($dataFile)) !== false) {
                // If found, read it to the string
                $data = $zip->getFromIndex($index);
                // Close archive file
                $zip->close();

                // Load XML from a string
                // Skip errors and warnings

                $xml = DOMDocument::loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);

                $contents = explode('\n',strip_tags($xml->saveXML()));
                $text = '';
                foreach($contents as $i=>$content) {
                    $text .= $contents[$i];
                }
                return $text;
            }
            $zip->close();
        }
        return "";
    }

如果有办法从Doc档案中提取文字内容,请与我们联系。

1 个答案:

答案 0 :(得分:4)

好吧我终于得到了答案,所以我想在这里分享一下。我只是使用了COM Objects:

$DocumentPath="C:/xampp/htdocs/abcd.doc";

$word = new COM("word.application") or die("Unable to instantiate application object");

$wordDocument = new COM("word.document") or die("Unable to instantiate document object");

$word->Visible = 0;

$wordDocument = $word->Documents->Open($DocumentPath);

$HTMLPath = substr_replace($DocumentPath, 'html', -3, 3);

$wordDocument->SaveAs($HTMLPath, 3);

$wordDocument = null;

$word->Quit();

$word = null;

readfile($HTMLPath);

unlink($HTMLPath);