如何在php中将doc,docx文件转换为纯文本?

时间:2018-06-28 12:01:25

标签: php

代码:

<?php
    function parseWord($userDoc) 
    {
        $fileHandle = fopen($userDoc, "r");
        $word_text = @fread($fileHandle, filesize($userDoc));
        $line = "";
        $tam = filesize($userDoc);
        $nulos = 0;
        $caracteres = 0;
        for($i=1536; $i<$tam; $i++)
        {
            $line .= $word_text[$i];
            if( $word_text[$i] == 0)
            {
                $nulos++;
            }
            else
            {
                $nulos=0;
                $caracteres++;
            }

            if( $nulos>1996)
            {   
                break;  
            }
        }
        $lines = explode(chr(0x0D),$line);
        $outtext = "";
        foreach($lines as $thisline)
        {
            $tam = strlen($thisline);
            if( !$tam )
            {
                continue;
            }
            $new_line = ""; 
            for($i=0; $i<$tam; $i++)
            {
                $onechar = $thisline[$i];
                if( $onechar > chr(240) )
                {
                    continue;
                }

                if( $onechar >= chr(0x20) )
                {
                    $caracteres++;
                    $new_line .= $onechar;
                }

                if( $onechar == chr(0x14) )
                {
                    $new_line .= "</a>";
                }
                if( $onechar == chr(0x07) )
                {
                    $new_line .= "\t";
                    if( isset($thisline[$i+1]) )
                    {
                        if( $thisline[$i+1] == chr(0x07) )
                        {
                            $new_line .= "\n";
                        }
                    }
                }
            }
            $new_line = str_replace("HYPERLINK" ,"<a href=",$new_line); 
            $new_line = str_replace("\o" ,">",$new_line); 
            $new_line .= "\n";
            $new_line = str_replace("INCLUDEPICTURE" ,"<br><img src=",$new_line); 
            $new_line = str_replace("\*" ,"><br>",$new_line); 
            $new_line = str_replace("MERGEFORMATINET" ,"",$new_line); 
            $outtext .= nl2br($new_line);
        }
        return $outtext;
    } 
    $userDoc = "upload_resume/".$upload_resume;
    $text = parseWord($userDoc);
    echo $text;
?>

我现在仅在我的upload_resume文件夹中上载doc,docx文件,我想显示doc,docx文件或使用此功能将其转换为纯文本格式,即parseWord我只读取文件并打印文本,但无法转换变成纯文本。当我看到我的输出时,看起来像。

... JA.a.} 7。“。H.w”넙.w̤ھ���P�^���O֛���;。“f3��\�ȾT��IS��̌����W����Y

我不知道问题出在哪里。所以,我该如何解决此问题?请帮助我。

谢谢

1 个答案:

答案 0 :(得分:0)

word文件不是可以打开和阅读的简单文本文档。有一些库可以用php实现。

https://github.com/PHPOffice/PHPWord