Question

我正在尝试从上传的文本文件中获取前1,000个字符。我在做：

if($file->simpletype=="document"){
    //get first 1000 chars in here
    $snippet = file_get_contents($_FILES['upload']['tmp_name'], false, null, -1, 1000);
    file_put_contents('/var/www/my_logs/log.log', $snippet);
    $file->snippet = $snippet;
}

这适用于.txt文件，我可以使用gedit打开并读取log.log文件。但是对于 .doc ， .docx ， .odt 和 .pdf 文件，file_get_contents()会返回乱码例如：PK\00\00\00\

我尝试过另一种在stackoverflow上找到的解决方案：

function file_get_contents_utf8() {
    $content = file_get_contents($_FILES['upload']['tmp_name'], false, null, -1, 1000);
    return mb_convert_encoding($content, 'UTF-8',
             mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
}

但我得到了同样的结果。有任何想法吗？谢谢！

Answer 1

您正在尝试从不使用纯文本格式的文件中读取文本。

要阅读doc / docx文件，您需要使用PHPDocx或http://phpword.codeplex.com等工具。

要解析PDF，请参阅this question的答案。

Answer 2

这绝对不适用于非纯文本文件。您需要先从doc / pdf / odt文档中获取纯文本，然后才能操作该文本。只需在记事本等简单文本编辑器中打开这些文档中的任何一个，然后查看其内容。

对于Word文档，您可以从http://phpword.codeplex.com/开始。另请查找可用于从这些文件中提取内容的其他库。

file_get_contents（）返回上传的word文档的无效字符

2 个答案: