Question

我一直致力于各种文件扩展名的文本提取项目，但我最痛苦的是pdf和powerpoint，这里是pdf的代码这里的任何人都知道如何使用任何工具或库tcpdf，xpdf或fpdfi从现有的pdf文档中读取文本，因为我还没有看到从pdf或ppt读取文本的任何确切解决方案，但请不要使用解决方案

function pdf2txt($filename){

    $data = getFileData($filename);

    // grab objects and then grab their contents (chunks)
    $a_obj = getDataArray($data,"obj","endobj");
    foreach($a_obj as $obj){

        $a_filter = getDataArray($obj,"<<",">>");
        if (is_array($a_filter)){
            $j++;
            $a_chunks[$j]["filter"] = $a_filter[0];

            $a_data = getDataArray($obj,"stream\r\n","endstream");
            if (is_array($a_data)){
                $a_chunks[$j]["data"] = substr($a_data[0],strlen("stream\r\n"),strlen($a_data[0])-strlen("stream\r\n")-strlen("endstream"));
            }
        }
    }

    // decode the chunks
    foreach($a_chunks as $chunk){

        // look at each chunk and decide how to decode it - by looking at the contents of the filter
        $a_filter = split("/",$chunk["filter"]);

        if ($chunk["data"]!=""){
            // look at the filter to find out which encoding has been used          
            if (substr($chunk["filter"],"FlateDecode")!==false){
                $data =@ gzuncompress($chunk["data"]);
                if (trim($data)!=""){
                    $result_data .= ps2txt($data);
                } else {

                    //$result_data .= "x";
                }
            }
        }
    }

    return $result_data;

}


// Function    : ps2txt()
// Arguments   : $ps_data - postscript data you want to convert to plain text
// Description : Does a very basic parse of postscript data to
//             :  return the plain text
// Author      : Jonathan Beckett, 2005-05-02
function ps2txt($ps_data){
    $result = "";
    $a_data = getDataArray($ps_data,"[","]");
    if (is_array($a_data)){
        foreach ($a_data as $ps_text){
            $a_text = getDataArray($ps_text,"(",")");
            if (is_array($a_text)){
                foreach ($a_text as $text){
                    $result .= substr($text,1,strlen($text)-2);
                }
            }
        }
    } else {
        // the data may just be in raw format (outside of [] tags)
        $a_text = getDataArray($ps_data,"(",")");
        if (is_array($a_text)){
            foreach ($a_text as $text){
                $result .= substr($text,1,strlen($text)-2);
            }
        }
    }
    return $result;
}


// Function    : getFileData()
// Arguments   : $filename - filename you want to load
// Description : Reads data from a file into a variable
//               and passes that data back
// Author      : Jonathan Beckett, 2005-05-02
function getFileData($filename){
    $handle = fopen($filename,"rb");
    $data = fread($handle, filesize($filename));
    fclose($handle);
    return $data;
}


// Function    : getDataArray()
// Arguments   : $data       - data you want to chop up
//               $start_word - delimiting characters at start of each chunk
//               $end_word   - delimiting characters at end of each chunk
// Description : Loop through an array of data and put all chunks
//               between start_word and end_word in an array
// Author      : Jonathan Beckett, 2005-05-02
function getDataArray($data,$start_word,$end_word){

    $start = 0;
    $end = 0;
    unset($a_result);

    while ($start!==false && $end!==false){
        $start = strpos($data,$start_word,$end);
        if ($start!==false){
            $end = strpos($data,$end_word,$start);
            if ($end!==false){
                // data is between start and end
                $a_result[] = substr($data,$start,$end-$start+strlen($end_word));
            }
        }
    }
    return $a_result;
}
this one is for powerpoint i found here some where but that isnt working also
function parsePPT($filename) {
// This approach uses detection of the string "chr(0f).Hex_value.chr(0x00).chr(0x00).chr(0x00)" to find text strings, which are then terminated by another NUL chr(0x00). [1] Get text between delimiters [2] 
    $fileHandle = fopen($filename, "r");
    $line = @fread($fileHandle, filesize($filename));
    $lines = explode(chr(0x0f),$line);
    $outtext = '';

    foreach($lines as $thisline) {
        if (strpos($thisline, chr(0x00).chr(0x00).chr(0x00)) == 1) {
            $text_line = substr($thisline, 4);
            $end_pos   = strpos($text_line, chr(0x00));
            $text_line = substr($text_line, 0, $end_pos);
            $text_line = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$text_line);
            if(substr($text_line,0,20)!="Click to edit Master")
            if (strlen($text_line) > 1) {
                $outtext.= substr($text_line, 0, $end_pos)."\n<br>";
            }
        }
    }
return $outtext;
}

Answer 1

你为什么要重新发明轮子？你可以使用ie。 xpdf或类似工具，用于提取PDF中的文本数据，然后处理该操作产生的纯文本文件。几乎任何包含文本的文件格式都可以使用相同的方法（即首先转换为纯文本版本，然后进行处理）......

如果您选择该解决方案，

Indexing PDF Documents with Zend_Search_Lucene可能是一个有趣的读物。

从pdf文档中读取和计算单词

1 个答案: