如何阅读docx内容,删除所有标签,但将其保留在下面?
以下是我从其他答案得到的代码:
//FUNCTION :: read a docx file and return the string
// http://stackoverflow.com/questions/4587216/how-can-i-convert-a-docx-document-to-html-using-php
// https://www.jackreichert.com/2012/11/how-to-convert-docx-to-html/
function readDocx($filePath) {
// Create new ZIP archive
$zip = new ZipArchive;
$dataFile = 'word/document.xml';
// Open received archive file
if (true === $zip->open($filePath)) {
// If done, search for the data file in the archive
if (($index = $zip->locateName($dataFile)) !== false) {
// If found, read it to the string
$data = $zip->getFromIndex($index);
// Close archive file
$zip->close();
// Load XML from a string
// Skip errors and warnings
$xml = DOMDocument::loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
// Return data without XML formatting tags
$xmldata = $xml->saveXML();
// </w:p> is what word uses to mark the end of a paragraph. E.g.
// <w:p>This is a paragraph.</w:p>
// <w:p>And a second one.</w:p>
// http://stackoverflow.com/questions/5607594/find-linebreaks-in-a-docx-file-using-php
$xmldata = str_replace("</w:p>", "\r\n", $xmldata);
$xmldata = str_replace("<w:i/>", "<i>", $xmldata);
$contents = explode('\n',strip_tags($xmldata, "<i>"));
$text = '';
foreach($contents as $i=>$content) {
$text .= $contents[$i];
}
return $text;
}
$zip->close();
}
// In case of failure return empty string
return "";
}
$filePath = 'sample.docx';
$string = readDocx($filePath);
var_dump($string);
到目前为止,我只设法保留换行符,但不是其余的:
$xmldata = str_replace("</w:p>", "\r\n", $xmldata);
$xmldata = str_replace("<w:i/>", "<i>", $xmldata); // will get <i>Hello World <-- no closing i
有什么想法吗?
修改
$xmldata = preg_replace("/<w\:i\/>(.*?)<\/w\:r>/is", "<i>$1</i>", $xmldata);
$xmldata = preg_replace("/<w\:b\/>(.*?)<\/w\:r>/is", "<b>$1</b>", $xmldata);
$xmldata = preg_replace("/<w\:u (.*?)\/>(.*?)<\/w\:r>/is", "<u>$2</u>", $xmldata);
但上述解决方案存在缺陷,例如:
<w:r><w:t xml:space="preserve"><w:i/>Hello</w:t></w:r><w:r><w:t xml:space="preserve"> World</w:t></w:r>
您会注意到我正在替换<w:i/>
和<\/w\:r>
,因为<w:i/>
尚未配对。
有更好的解决方案吗?
答案 0 :(得分:2)
我不认为需要这些str_repalce()
和explode()
功能,因此我会做一个strip_tags()
:
$contents = strip_tags($xmldata, '<w:p><w:u><w:i><w:b>');
到目前为止,您确定所有需要的标签都会被保留。采取另一个步骤,我们应该将<w:*>
标记替换为相应的HTML标记:
$contents = preg_replace("/(<(\/?)w:(.)[^>]*>)\1*/", "<$2$3>", $contents);
我们只有名称为<p>
,<b>
,<i>
,<u>
的HTML标记,因此捕获其名称就像使用一样简单点捕获小组 :
( # (1 start)
< # Match XML opening tag character
( \/? ) # (2) Match if it is going to be an ending tag
w: # Literal `w:`
( . ) # (3) Match b,p,u,i
[^>]* > # Up to closing tag character
) # (1 end)
\1* # Match if latter group repeats
我必须检查相同的匹配标签\1*
,因为我发现它很有可能发生。如果我们的docx文件包含如下三行:
粗体强>
斜体
正常
然后在这一点上我们的输出类似于:
<p><b><b>Bold</p><p><i><i>Italic</p><p>Normal</p>
但正如您所看到的,我们有不成对的重复标签,这些标签根本不是很好。我们应该清理我们的文件。但是如何?
DOMDocument
对象尽管PHP Tidy非常适合这类工作,但我发现DOMDocument
更适合完成我们的任务:
$dom = new DOMDocument;
@$dom->loadHTML($contents, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$contents = $dom->saveHTML();
我们设置了两个相关的标记,因为我们不需要HTML DOCTYPE
以及<html>
/ <body>
标记。
此时我们的输出:
<p><b><b>Bold</b></b><p><i><i>Italic</i></i></p><p>Normal</p></p>
好消息是现在我们有了标签,但是我们有不必要的打开标签可能是一个坏消息:
<p><b><b>Bold</b></b><p><i><i>Italic</i></i></p><p>Normal</p></p>
^ ^ ^ ^
关于删除额外开放标记的工作解决方案,我写了另一个RegEx:
$contents = preg_replace('~<([ibu])>(?=(?:\s*<[ibu]>\s*)*?<\1>)|</([ibu])>(?=(?:\s*</?[ibu]>\s*)*?</?\2>)|<p></p>~s', "", $contents);
这里可以看到它的作用:
< # Match an opening tag
( [ibu] ) # (1) Any type except `p`
> # Up to closing character
(?= # Which is immediately followed by
(?: \s* < [ibu] > \s* )*? # Another opening tag (or nothing)
< \1 > # And then its own closing tag.
) # End of lookahead
| # Or match
</ # A closing tag
( [ibu] ) # (2) Any type except `p`
> # Up to closing character
(?= # Which is immediately followed by
(?: \s* </ [ibu] > \s* )*? # Another closing tag (or nothing)
</? \2 > # And then the same closing tag
) # End of lookahead
| # Or match
<p></p> # Empty <p> tags
现在我们有正确的输出:
<p><b>Bold</b><p><i>Italic</i></p><p>Normal</p></p>
把所有事情放在一起:
<?php
function readDocx($filePath) {
// Create new ZIP archive
$zip = new ZipArchive;
$dataFile = 'word/document.xml';
// Open received archive file
if (true === $zip->open($filePath)) {
// If done, search for the data file in the archive
if (($index = $zip->locateName($dataFile)) !== false) {
$data = $zip->getFromIndex($index);
$zip->close();
$dom = new DOMDocument;
$dom->loadXML($data, LIBXML_NOENT
| LIBXML_XINCLUDE
| LIBXML_NOERROR
| LIBXML_NOWARNING);
$xmldata = $dom->saveXML();
$contents = strip_tags($xmldata, '<w:p><w:u><w:i><w:b>');
$contents = preg_replace("/(<(\/?)w:(.)[^>]*>)\1*/", "<$2$3>", $contents);
$dom = new DOMDocument;
@$dom->loadHTML($contents, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$contents = $dom->saveHTML();
$contents = preg_replace('~<([ibu])>(?=(?:\s*<[ibu]>\s*)*?<\1>)|</([ibu])>(?=(?:\s*</[ibu]>\s*)*?</?\2>)|<p></p>~s', "", $contents);
return $contents;
}
$zip->close();
}
// In case of failure return empty string
return "";
}
$filePath = 'sample.docx';
$string = readDocx($filePath);
echo $string;
答案 1 :(得分:1)
我有这些解决方案 - 它很难看,但它有效:
$xmldata =
'<w:r>
<w:rPr>
<w:u/>
<w:b/>
<w:i/>
</w:rPr>
<w:t>I feel that there is much to be said for the Celtic belief that the souls of those whom we have lost are held captive in some inferior being...</w:t>
</w:r>';
// </w:p> is what word uses to mark the end of a paragraph. E.g.
// <w:p>This is a paragraph.</w:p>
// <w:p>And a second one.</w:p>
// http://stackoverflow.com/questions/5607594/find-linebreaks-in-a-docx-file-using-php
// http://officeopenxml.com/WPtext.php
$xmldata = str_replace("</w:p>", "\r\n", $xmldata);
$xmldata = preg_replace("/<w\:i\/>(.*?)<w:t(.*?)>(.*?)<\/w\:t>/is", "<w:i/>$1<w:t$2><i>$3</i></w:t>", $xmldata);
$xmldata = preg_replace("/<w\:b\/>(.*?)<w:t(.*?)>(.*?)<\/w\:t>/is", "<w:b/>$1<w:t$2><b>$3</b></w:t>", $xmldata);
$xmldata = preg_replace("/<w\:u(.*?)\/>(.*?)<w:t(.*?)>(.*?)<\/w\:t>/is", "<w:u$1/>$2<w:t$3><u>$4</u></w:t>", $xmldata);
输出:
<u><b><i>I feel that there is much to be said for the Celtic belief that the souls of those whom we have lost are held captive in some inferior being...</i></b></u>
答案 2 :(得分:0)
剥离标签并不是一个好方法,因为根据您当前的解决方案,您不会结束格式化 - 您应该考虑解释xml而不是
您搜索的其他代码为<w:b/>
(粗体)和<w:u ...>
(下划线)