Question

我正在使用pdfgrep搜索PDF文档中关键字的所有外观。

现在，我想通过PHP进行此操作，以便可以在我的网站中使用它。

但是，当我跑步时：

$output = shell_exec("pdfgrep -i $keyword $file");
$var_dump($output);

在$keyword是关键字而$file是文件的情况下，我没有得到全部输出。

PDF由产品代码，产品名称和产品价格的表格组成。

当我通过终端执行命令时，我可以看到整行数据：

product code 1    product name with keyword substring    corresponding price
product code 2    product name with keyword substring    corresponding price
product code 3    product name with keyword substring    corresponding price

但是，当我通过PHP运行它时，我得到了类似的东西：

name with keyword substring with keyword substring product code 1 
product name with keyword substring product name with keyword substring 
corresponding price

它只是无法获取所有数据。它并不总是获得产品代码和价格，并且在很多情况下，它也没有获得完整的产品名称。

我通过浏览器查看输出，并放入header('Content-Type: text/plain');，但它仅用于美化输出，数据仍不完整。

我试图通过Python3.6运行完全相同的shell脚本，这给了我所需的输出。

现在，我尝试通过PHP运行相同的Python脚本，但仍会得到相同的损坏输出。

我尝试运行一个我知道会返回较短输出的关键字，但是我仍然没有得到所需的整个数据行。

有什么方法可以可靠地获取shell_exec()命令抛出的所有数据吗？

是否存在其他选择，例如不同的命令或从服务器运行Python脚本（因为Python脚本始终没有任何问题）。

Answer 1

我不知道pdfgrep的工作原理，但也许它混入了stdout和stderr？无论哪种方式，都可以使用类似的构造，在该构造中，将输出流捕获到输出缓冲区中，还可以选择将stderr混合到stdout中：

ValueError: cannot reshape array of size 6 into shape (6,3)

Answer 2

执行过程和收集输出的方法有很多种。如果您可以一贯地重复该问题，则可以尝试其他流程执行方法：

1）exec（$ command，＆$ output）

$output = [];
exec($command, $output);

这应该将所有输出逐行推送到$ output数组中，该输出必须在调用此方法之前实例化。

2）passthru（$ command）

这会将命令的所有输出传递回输出缓冲区。所以要使用它，您需要使用输出缓冲区：

ob_start();
passthru($command);
$contents = ob_get_contents();
ob_end_clean();

3）popen（$ command，“ r”）;

$output = "";
$handle = open($command, "r");
while (!feof($handle)){
    $output .= fread($handle, 4096);
}

让我知道通过调用每个方法会得到什么。

此外，请确保检查stderror是否有错误。

Answer 3

在数字化合同的PDF文件库中，我遇到了完全相同的问题。

函数“ exec”，“ shell_exec”和“ passthru”的输出确实是随机运行的，以至于我不得不选择一种创造性的解决方案：使用ssh2_connect和ssh2_exec连接到自身计算机。

省略SSH连接部分（https://www.php.net/manual/es/function.ssh2-connect.php），代码为：

// command execution
$stream     = ssh2_exec($connection, "pdfgrep -i {$keyword} {$file}");
// set stream block: queue executions to avoid overlapping
stream_set_blocking($stream, true);
// catch stream block
$stream_out = ssh2_fetch_stream($stream, SSH2_STREAM_STDIO);
// output
return stream_get_contents($stream_out);

乍一看可能有点复杂，但是从长远来看还有更多好处：

每次执行都限于用户权限，而不是 www-data的权限
如果要使用PDF内容创建“提取缓存”，则新文件将由正确的用户创建
如果PDF保管库增长如此之快以至于必须将其外部化到CDN，使用这种方法，足以连接到它所在的主机要搜索。
最重要的是，您将获得完整的stream_out

再见！

PHP-运行shell_exec（）不会返回所有输出

3 个答案: