我能够提取Docx文件的文本内容,我想对Doc文件执行相同的操作。我尝试使用相同的代码但无法读取任何内容。我想原因是“Doc格式不是压缩档案。”这是代码:
function readDocx ($filePath)
{
// Create new ZIP archive
$zip = new ZipArchive;
$dataFile = 'word/document.xml';
// Open received archive file
if (true === $zip->open($filePath)) {
// If done, search for the data file in the archive
if (($index = $zip->locateName($dataFile)) !== false) {
// If found, read it to the string
$data = $zip->getFromIndex($index);
// Close archive file
$zip->close();
// Load XML from a string
// Skip errors and warnings
$xml = DOMDocument::loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
$contents = explode('\n',strip_tags($xml->saveXML()));
$text = '';
foreach($contents as $i=>$content) {
$text .= $contents[$i];
}
return $text;
}
$zip->close();
}
return "";
}
如果有办法从Doc档案中提取文字内容,请与我们联系。
答案 0 :(得分:4)
好吧我终于得到了答案,所以我想在这里分享一下。我只是使用了COM Objects:
$DocumentPath="C:/xampp/htdocs/abcd.doc";
$word = new COM("word.application") or die("Unable to instantiate application object");
$wordDocument = new COM("word.document") or die("Unable to instantiate document object");
$word->Visible = 0;
$wordDocument = $word->Documents->Open($DocumentPath);
$HTMLPath = substr_replace($DocumentPath, 'html', -3, 3);
$wordDocument->SaveAs($HTMLPath, 3);
$wordDocument = null;
$word->Quit();
$word = null;
readfile($HTMLPath);
unlink($HTMLPath);