关于PDF到文本的另一个问题......(任何服务器端解决方案!)

时间:2011-05-27 23:25:25

标签: php pdf text

我一直在努力寻找将PDF文档转换为文本的方法。

以下解决方案效果最好,但它并不适用于所有pdf .. 他们都是:

PDF-1.4
5 0 obj
Length 6 0 R/Filter /FlateDecode

。 我需要做这个服务器端,我无法安装模块。我没有问题格式化代码输出的字符串。我的大脑很难找到。

function pdf2string($sourcefile) { 

$fp = fopen($sourcefile, 'rb'); 
$content = fread($fp, filesize($sourcefile)); 
fclose($fp); 

$searchstart = 'stream'; 
$searchend = 'endstream'; 
$pdfText = ''; 
$pos = 0; 
$pos2 = 0; 
$startpos = 0; 

while ($pos !== false && $pos2 !== false) { 

    $pos = strpos($content, $searchstart, $startpos); 
    $pos2 = strpos($content, $searchend, $startpos + 1); 

    if ($pos !== false && $pos2 !== false){ 

        if ($content[$pos] == 0x0d && $content[$pos + 1] == 0x0a) { 
            $pos += 2; 
        } else if ($content[$pos] == 0x0a) { 
            $pos++; 
        } 

        if ($content[$pos2 - 2] == 0x0d && $content[$pos2 - 1] == 0x0a) { 
            $pos2 -= 2; 
        } else if ($content[$pos2 - 1] == 0x0a) { 
            $pos2--; 
        } 

        $textsection = substr( 
            $content, 
            $pos + strlen($searchstart) + 2, 
            $pos2 - $pos - strlen($searchstart) - 1 
        ); 
        $data = @gzuncompress($textsection); 
        $pdfText .="<br>".pdfExtractText($data); 
        $startpos = $pos2 + strlen($searchend) - 1; 

    } 
} 

return preg_replace('/(\s)+/', ' ', $pdfText); 

} 

function pdfExtractText($psData){ 

if (!is_string($psData)) { 
    return ''; 
} 

$text = ''; 

// Handle brackets in the text stream that could be mistaken for 
// the end of a text field. I'm sure you can do this as part of the 
// regular expression, but my skills aren't good enough yet. 
$psData = str_replace('\)', '##ENDBRACKET##', $psData); 
$psData = str_replace('\]', '##ENDSBRACKET##', $psData); 

preg_match_all( 
    '/(T[wdcm*])[\s]*(\[([^\]]*)\]|\(([^\)]*)\))[\s]*Tj/si', 
    $psData, 
    $matches 
); 
for ($i = 0; $i < sizeof($matches[0]); $i++) { 
    if ($matches[3][$i] != '') { 
        // Run another match over the contents. 
        preg_match_all('/\(([^)]*)\)/si', $matches[3][$i], $subMatches); 
        foreach ($subMatches[1] as $subMatch) { 
            $text .= $subMatch; 
        } 
    } else if ($matches[4][$i] != '') { 
        $text .= ($matches[1][$i] == 'Tc' ? ' ' : '') . $matches[4][$i]; 
    } 
} 

// Translate special characters and put back brackets. 
$trans = array( 
    '...'                => '…', 
    '\205'                => '…', 
    '\221'                => chr(145), 
    '\222'                => chr(146), 
    '\223'                => chr(147), 
    '\224'                => chr(148), 
    '\226'                => '-', 
    '\267'                => '•', 
    '\('                => '(', 
    '\['                => '[', 
    '##ENDBRACKET##'    => ')', 
    '##ENDSBRACKET##'    => ']', 
    chr(133)            => '-', 
    chr(141)            => chr(147), 
    chr(142)            => chr(148), 
    chr(143)            => chr(145), 
    chr(144)            => chr(146), 
); 
$text = strtr($text, $trans); 

return $text; 

} 

1 个答案:

答案 0 :(得分:1)

检查服务器上是否安装了“pdftotext”:

echo shell_exec('pdftotext --help');

如果是,则使用它轻松将pdf转换为文本。

如果没有,请尝试downloading the source code查看他们是如何做到的(pdftotext是开源的)。