PHP读取docx文件内容,但保持换行符,斜体,下划线和粗体?

时间:2016-07-15 10:06:06

标签: php domdocument docx strip-tags

如何阅读docx内容,删除所有标签,但将其保留在下面?

  1. Bold
  2. 斜体
  3. 下划线
  4. 新行
  5. 以下是我从其他答案得到的代码:

    //FUNCTION :: read a docx file and return the string
    // http://stackoverflow.com/questions/4587216/how-can-i-convert-a-docx-document-to-html-using-php
    // https://www.jackreichert.com/2012/11/how-to-convert-docx-to-html/
    function readDocx($filePath) {
        // Create new ZIP archive
        $zip = new ZipArchive;
        $dataFile = 'word/document.xml';
        // Open received archive file
        if (true === $zip->open($filePath)) {
            // If done, search for the data file in the archive
            if (($index = $zip->locateName($dataFile)) !== false) {
                // If found, read it to the string
                $data = $zip->getFromIndex($index);
                // Close archive file
                $zip->close();
                // Load XML from a string
                // Skip errors and warnings
                $xml = DOMDocument::loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
                // Return data without XML formatting tags
                $xmldata = $xml->saveXML();
                // </w:p> is what word uses to mark the end of a paragraph. E.g.
                // <w:p>This is a paragraph.</w:p>
                // <w:p>And a second one.</w:p>
                // http://stackoverflow.com/questions/5607594/find-linebreaks-in-a-docx-file-using-php
                $xmldata = str_replace("</w:p>", "\r\n", $xmldata);
                $xmldata = str_replace("<w:i/>", "<i>", $xmldata);
    
                $contents = explode('\n',strip_tags($xmldata, "<i>"));
                $text = '';
                foreach($contents as $i=>$content) {
                    $text .= $contents[$i];
                }
                return $text;
            }
            $zip->close();
        }
        // In case of failure return empty string
        return "";
    }
    
    $filePath = 'sample.docx';
    $string = readDocx($filePath);
    var_dump($string);
    

    到目前为止,我只设法保留换行符,但不是其余的:

    $xmldata = str_replace("</w:p>", "\r\n", $xmldata);
    $xmldata = str_replace("<w:i/>", "<i>", $xmldata); // will get <i>Hello World <-- no closing i
    

    有什么想法吗?

    修改

    $xmldata = preg_replace("/<w\:i\/>(.*?)<\/w\:r>/is", "<i>$1</i>", $xmldata);
    $xmldata = preg_replace("/<w\:b\/>(.*?)<\/w\:r>/is", "<b>$1</b>", $xmldata);
    $xmldata = preg_replace("/<w\:u (.*?)\/>(.*?)<\/w\:r>/is", "<u>$2</u>", $xmldata);
    

    但上述解决方案存在缺陷,例如:

    <w:r><w:t xml:space="preserve"><w:i/>Hello</w:t></w:r><w:r><w:t xml:space="preserve"> World</w:t></w:r>
    

    您会注意到我正在替换<w:i/><\/w\:r>,因为<w:i/>尚未配对。

    有更好的解决方案吗?

3 个答案:

答案 0 :(得分:2)

我不认为需要这些str_repalce()explode()功能,因此我会做一个strip_tags()

$contents = strip_tags($xmldata, '<w:p><w:u><w:i><w:b>');

到目前为止,您确定所有需要的标签都会被保留。采取另一个步骤,我们应该将<w:*>标记替换为相应的HTML标记:

$contents = preg_replace("/(<(\/?)w:(.)[^>]*>)\1*/", "<$2$3>", $contents);

我们只有名称为<p><b><i><u>的HTML标记,因此捕获其名称就像使用一样简单点捕获小组

 (               # (1 start)
      <             # Match XML opening tag character           
      ( \/? )       # (2) Match if it is going to be an ending tag
      w:            # Literal `w:`
      ( . )         # (3) Match b,p,u,i
      [^>]* >       # Up to closing tag character
 )               # (1 end)
 \1*             # Match if latter group repeats 

我必须检查相同的匹配标签\1*,因为我发现它很有可能发生。如果我们的docx文件包含如下三行:

粗体

斜体

正常

然后在这一点上我们的输出类似于:

<p><b><b>Bold</p><p><i><i>Italic</p><p>Normal</p>

但正如您所看到的,我们有不成对的重复标签,这些标签根本不是很好。我们应该清理我们的文件。但是如何?

  1. 通过PHP Tidy扩展
  2. 将我们的HTML加载到DOMDocument对象
  3. 尽管PHP Tidy非常适合这类工作,但我发现DOMDocument更适合完成我们的任务:

    $dom = new DOMDocument;
    @$dom->loadHTML($contents, LIBXML_HTML_NOIMPLIED  | LIBXML_HTML_NODEFDTD);
    $contents = $dom->saveHTML();
    

    我们设置了两个相关的标记,因为我们不需要HTML DOCTYPE以及<html> / <body>标记。

    此时我们的输出:

    <p><b><b>Bold</b></b><p><i><i>Italic</i></i></p><p>Normal</p></p>
    

    好消息是现在我们有了标签,但是我们有不必要的打开标签可能是一个坏消息:

    <p><b><b>Bold</b></b><p><i><i>Italic</i></i></p><p>Normal</p></p>
       ^  ^                 ^  ^
    

    关于删除额外开放标记的工作解决方案,我写了另一个RegEx:

    $contents = preg_replace('~<([ibu])>(?=(?:\s*<[ibu]>\s*)*?<\1>)|</([ibu])>(?=(?:\s*</?[ibu]>\s*)*?</?\2>)|<p></p>~s', "", $contents);
    

    这里可以看到它的作用:

     <                                  # Match an opening tag
     ( [ibu] )                          # (1) Any type except `p`
     >                                  # Up to closing character
     (?=                                # Which is immediately followed by
          (?: \s* < [ibu] > \s* )*?     # Another opening tag (or nothing)
          < \1 >                        # And then its own closing tag.
     )                                  # End of lookahead
     |                                  # Or match
     </                                 # A closing tag
     ( [ibu] )                          # (2) Any type except `p`
     >                                  # Up to closing character
     (?=                                # Which is immediately followed by
          (?: \s* </ [ibu] > \s* )*?    # Another closing tag (or nothing)
          </? \2 >                      # And then the same closing tag
     )                                  # End of lookahead
     |                                  # Or match
     <p></p>                            # Empty <p> tags
    

    现在我们有正确的输出:

    <p><b>Bold</b><p><i>Italic</i></p><p>Normal</p></p>
    

    把所有事情放在一起:

    <?php
    
    function readDocx($filePath) {
        // Create new ZIP archive
        $zip = new ZipArchive;
        $dataFile = 'word/document.xml';
        // Open received archive file
        if (true === $zip->open($filePath)) {
            // If done, search for the data file in the archive
            if (($index = $zip->locateName($dataFile)) !== false) {
                $data = $zip->getFromIndex($index);
                $zip->close();
    
                $dom = new DOMDocument;
                $dom->loadXML($data, LIBXML_NOENT
                    | LIBXML_XINCLUDE
                    | LIBXML_NOERROR
                    | LIBXML_NOWARNING);
    
                $xmldata = $dom->saveXML();
    
                $contents = strip_tags($xmldata, '<w:p><w:u><w:i><w:b>');
                $contents = preg_replace("/(<(\/?)w:(.)[^>]*>)\1*/", "<$2$3>", $contents);
    
                $dom = new DOMDocument;
                @$dom->loadHTML($contents, LIBXML_HTML_NOIMPLIED  | LIBXML_HTML_NODEFDTD);
                $contents = $dom->saveHTML();
    
                $contents = preg_replace('~<([ibu])>(?=(?:\s*<[ibu]>\s*)*?<\1>)|</([ibu])>(?=(?:\s*</[ibu]>\s*)*?</?\2>)|<p></p>~s', "", $contents);
    
                return $contents;
            }
            $zip->close();
        }
        // In case of failure return empty string
        return "";
    }
    
    $filePath = 'sample.docx';
    $string = readDocx($filePath);
    echo $string;
    

答案 1 :(得分:1)

我有这些解决方案 - 它很难看,但它有效:

        $xmldata =
                    '<w:r>
        <w:rPr>
        <w:u/>
        <w:b/>
        <w:i/>
        </w:rPr>
        <w:t>I feel that there is much to be said for the Celtic belief that the souls of those whom we have lost are held captive in some inferior being...</w:t>
        </w:r>';
        // </w:p> is what word uses to mark the end of a paragraph. E.g.
        // <w:p>This is a paragraph.</w:p>
        // <w:p>And a second one.</w:p>
        // http://stackoverflow.com/questions/5607594/find-linebreaks-in-a-docx-file-using-php
        // http://officeopenxml.com/WPtext.php
        $xmldata = str_replace("</w:p>", "\r\n", $xmldata);
        $xmldata = preg_replace("/<w\:i\/>(.*?)<w:t(.*?)>(.*?)<\/w\:t>/is", "<w:i/>$1<w:t$2><i>$3</i></w:t>", $xmldata);
        $xmldata = preg_replace("/<w\:b\/>(.*?)<w:t(.*?)>(.*?)<\/w\:t>/is", "<w:b/>$1<w:t$2><b>$3</b></w:t>", $xmldata);
        $xmldata = preg_replace("/<w\:u(.*?)\/>(.*?)<w:t(.*?)>(.*?)<\/w\:t>/is", "<w:u$1/>$2<w:t$3><u>$4</u></w:t>", $xmldata);

输出:

<u><b><i>I feel that there is much to be said for the Celtic belief that the souls of those whom we have lost are held captive in some inferior being...</i></b></u>

答案 2 :(得分:0)

剥离标签并不是一个好方法,因为根据您当前的解决方案,您不会结束格式化 - 您应该考虑解释xml而不是

您搜索的其他代码为<w:b/>(粗体)和<w:u ...>(下划线)