Question

我一直在使用一个非常有用的工具来阅读提交的Word文档作为接受的答案：How to extract text from word file .doc,docx,.xlsx,.pptx php

除了有时它省略了.doc文件的前几行文本之外，它的效果非常好。

这是读取.doc文件的函数：

private function read_doc() {
    $fileHandle = fopen($this->filename, "r");
    $line = @fread($fileHandle, filesize($this->filename));   
    $lines = explode(chr(0x0D),$line);
    $outtext = "";
    foreach($lines as $thisline)
      {
        $pos = strpos($thisline, chr(0x00));
        if (($pos !== FALSE)||(strlen($thisline)==0))
          {
          } else {
            $outtext .= $thisline." ";
          }
      }
     $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
    return $outtext;
}

似乎问题在于这一部分：

$pos = strpos($thisline, chr(0x00));
        if (($pos !== FALSE)||(strlen($thisline)==0))

虽然这正确地删除了文档中不包含文本内容的部分，但有时似乎有责任删除第一行文本内容。

在读取.doc文件时，如何修改此功能以避免此问题？

Answer 1

我提出了以下解决方法，似乎可以解决问题。我使用strrpos而不是strpos来获取00x0字符行中的最后一个匹配项，因为行中的文本是文本内容。如果它是内容开始之前文档编码的最后一位，那么它会将该行的文本部分添加到输出中。

private function read_doc() {
    $fileHandle = fopen($this->filename, "r");
    $line = @fread($fileHandle, filesize($this->filename));   
    $lines = explode(chr(0x0D),$line);
    $outtext = "";
    $content_started=false;
    foreach($lines as $thisline){
        $pos = strrpos($thisline, chr(0x00));
        if (($pos !== FALSE)||(strlen($thisline)==0)){          
        } 
        else {
            if(!$content_started){
                $outtext.=substr($lastline,$lastpos)." ";
            }
            $content_started=true;
            $outtext .= $thisline." ";
        }
          $lastline=$thisline;
          $lastpos=$pos;
      }
    $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
    return $outtext;
}

在阅读Word doc时，函数有时会跳过第一行

1 个答案: