Question

如何阅读docx内容，删除所有标签，但将其保留在下面？

Bold
斜体
下划线
新行

以下是我从其他答案得到的代码：

//FUNCTION :: read a docx file and return the string
// http://stackoverflow.com/questions/4587216/how-can-i-convert-a-docx-document-to-html-using-php
// https://www.jackreichert.com/2012/11/how-to-convert-docx-to-html/
function readDocx($filePath) {
    // Create new ZIP archive
    $zip = new ZipArchive;
    $dataFile = 'word/document.xml';
    // Open received archive file
    if (true === $zip->open($filePath)) {
        // If done, search for the data file in the archive
        if (($index = $zip->locateName($dataFile)) !== false) {
            // If found, read it to the string
            $data = $zip->getFromIndex($index);
            // Close archive file
            $zip->close();
            // Load XML from a string
            // Skip errors and warnings
            $xml = DOMDocument::loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
            // Return data without XML formatting tags
            $xmldata = $xml->saveXML();
            // </w:p> is what word uses to mark the end of a paragraph. E.g.
            // <w:p>This is a paragraph.</w:p>
            // <w:p>And a second one.</w:p>
            // http://stackoverflow.com/questions/5607594/find-linebreaks-in-a-docx-file-using-php
            $xmldata = str_replace("</w:p>", "\r\n", $xmldata);
            $xmldata = str_replace("<w:i/>", "<i>", $xmldata);

            $contents = explode('\n',strip_tags($xmldata, "<i>"));
            $text = '';
            foreach($contents as $i=>$content) {
                $text .= $contents[$i];
            }
            return $text;
        }
        $zip->close();
    }
    // In case of failure return empty string
    return "";
}

$filePath = 'sample.docx';
$string = readDocx($filePath);
var_dump($string);

到目前为止，我只设法保留换行符，但不是其余的：

$xmldata = str_replace("</w:p>", "\r\n", $xmldata);
$xmldata = str_replace("<w:i/>", "<i>", $xmldata); // will get <i>Hello World <-- no closing i

有什么想法吗？

修改

$xmldata = preg_replace("/<w\:i\/>(.*?)<\/w\:r>/is", "<i>$1</i>", $xmldata);
$xmldata = preg_replace("/<w\:b\/>(.*?)<\/w\:r>/is", "<b>$1</b>", $xmldata);
$xmldata = preg_replace("/<w\:u (.*?)\/>(.*?)<\/w\:r>/is", "<u>$2</u>", $xmldata);

但上述解决方案存在缺陷，例如：

<w:r><w:t xml:space="preserve"><w:i/>Hello</w:t></w:r><w:r><w:t xml:space="preserve"> World</w:t></w:r>

您会注意到我正在替换<w:i/>和<\/w\:r>，因为<w:i/>尚未配对。

有更好的解决方案吗？

Answer 1

我不认为需要这些str_repalce()和explode()功能，因此我会做一个strip_tags()：

$contents = strip_tags($xmldata, '<w:p><w:u><w:i><w:b>');

到目前为止，您确定所有需要的标签都会被保留。采取另一个步骤，我们应该将<w:*>标记替换为相应的HTML标记：

$contents = preg_replace("/(<(\/?)w:(.)[^>]*>)\1*/", "<$2$3>", $contents);

我们只有名称为，，，的HTML标记，因此捕获其名称就像使用一样简单点捕获小组 ：

 (               # (1 start)
      <             # Match XML opening tag character           
      ( \/? )       # (2) Match if it is going to be an ending tag
      w:            # Literal `w:`
      ( . )         # (3) Match b,p,u,i
      [^>]* >       # Up to closing tag character
 )               # (1 end)
 \1*             # Match if latter group repeats

我必须检查相同的匹配标签\1*，因为我发现它很有可能发生。如果我们的docx文件包含如下三行：

粗体

斜体

正常

然后在这一点上我们的输出类似于：

BoldItalicNormal

但正如您所看到的，我们有不成对的重复标签，这些标签根本不是很好。我们应该清理我们的文件。但是如何？

通过PHP Tidy扩展

将我们的HTML加载到DOMDocument对象
中
尽管PHP Tidy非常适合这类工作，但我发现DOMDocument更适合完成我们的任务：

$dom = new DOMDocument; @$dom->loadHTML($contents, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD); $contents = $dom->saveHTML();

我们设置了两个相关的标记，因为我们不需要HTML DOCTYPE以及<html> / <body>标记。

此时我们的输出：

BoldItalicNormal

好消息是现在我们有了标签，但是我们有不必要的打开标签可能是一个坏消息：

BoldItalicNormal ^ ^ ^ ^

关于删除额外开放标记的工作解决方案，我写了另一个RegEx：

$contents = preg_replace('~<([ibu])>(?=(?:\s*<[ibu]>\s*)*?<\1>)|</([ibu])>(?=(?:\s*</?[ibu]>\s*)*?</?\2>)|~s', "", $contents);

这里可以看到它的作用：

< # Match an opening tag ( [ibu] ) # (1) Any type except `p` > # Up to closing character (?= # Which is immediately followed by (?: \s* < [ibu] > \s* )*? # Another opening tag (or nothing) < \1 > # And then its own closing tag. ) # End of lookahead | # Or match </ # A closing tag ( [ibu] ) # (2) Any type except `p` > # Up to closing character (?= # Which is immediately followed by (?: \s* </ [ibu] > \s* )*? # Another closing tag (or nothing) </? \2 > # And then the same closing tag ) # End of lookahead | # Or match # Empty tags

现在我们有正确的输出：

BoldItalicNormal

把所有事情放在一起：

<?php function readDocx($filePath) { // Create new ZIP archive $zip = new ZipArchive; $dataFile = 'word/document.xml'; // Open received archive file if (true === $zip->open($filePath)) { // If done, search for the data file in the archive if (($index = $zip->locateName($dataFile)) !== false) { $data = $zip->getFromIndex($index); $zip->close(); $dom = new DOMDocument; $dom->loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING); $xmldata = $dom->saveXML(); $contents = strip_tags($xmldata, '<w:p><w:u><w:i><w:b>'); $contents = preg_replace("/(<(\/?)w:(.)[^>]*>)\1*/", "<$2$3>", $contents); $dom = new DOMDocument; @$dom->loadHTML($contents, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD); $contents = $dom->saveHTML(); $contents = preg_replace('~<([ibu])>(?=(?:\s*<[ibu]>\s*)*?<\1>)|</([ibu])>(?=(?:\s*</[ibu]>\s*)*?</?\2>)|~s', "", $contents); return $contents; } $zip->close(); } // In case of failure return empty string return ""; } $filePath = 'sample.docx'; $string = readDocx($filePath); echo $string;

Answer 2

我有这些解决方案 - 它很难看，但它有效：

        $xmldata =
                    '<w:r>
        <w:rPr>
        <w:u/>
        <w:b/>
        <w:i/>
        </w:rPr>
        <w:t>I feel that there is much to be said for the Celtic belief that the souls of those whom we have lost are held captive in some inferior being...</w:t>
        </w:r>';
        // </w:p> is what word uses to mark the end of a paragraph. E.g.
        // <w:p>This is a paragraph.</w:p>
        // <w:p>And a second one.</w:p>
        // http://stackoverflow.com/questions/5607594/find-linebreaks-in-a-docx-file-using-php
        // http://officeopenxml.com/WPtext.php
        $xmldata = str_replace("</w:p>", "\r\n", $xmldata);
        $xmldata = preg_replace("/<w\:i\/>(.*?)<w:t(.*?)>(.*?)<\/w\:t>/is", "<w:i/>$1<w:t$2><i>$3</i></w:t>", $xmldata);
        $xmldata = preg_replace("/<w\:b\/>(.*?)<w:t(.*?)>(.*?)<\/w\:t>/is", "<w:b/>$1<w:t$2><b>$3</b></w:t>", $xmldata);
        $xmldata = preg_replace("/<w\:u(.*?)\/>(.*?)<w:t(.*?)>(.*?)<\/w\:t>/is", "<w:u$1/>$2<w:t$3><u>$4</u></w:t>", $xmldata);

输出：

<u><b><i>I feel that there is much to be said for the Celtic belief that the souls of those whom we have lost are held captive in some inferior being...</i></b></u>

Answer 3

剥离标签并不是一个好方法，因为根据您当前的解决方案，您不会结束格式化 - 您应该考虑解释xml而不是

您搜索的其他代码为<w:b/>（粗体）和<w:u ...>（下划线）

PHP读取docx文件内容，但保持换行符，斜体，下划线和粗体？

3 个答案: