Question

我正在尝试解析HTTP文档以提取文档的某些部分，但无法获得所需的结果。这是我得到的：

<?php

// a sample of HTTP document that I am trying to parse
$http_response = <<<'EOT'
<dl><dt>Server Version: Apache</dt>
<dt>Server Built: Apr  4 2010 17:19:54
</dt></dl><hr /><dl>
<dt>Current Time: Wednesday, 10-Oct-2012 06:14:05 MST</dt>
</dl>
I do not need anything below this, including this line itself
......
EOT;

echo $http_response;
echo '********************';
$count = -1;
$a = preg_replace("/(Server Version)([\s\S]*?)(MST)/", "$1$2$3", $http_response, -1, $count);
echo "<br> count: $count" . '<br>';
echo $a;

我仍然在输出中看到字符串“我不需要......”。我不需要那个字符串。我做错了什么？
如何轻松删除所有其他HTML标记？

感谢您的帮助。

-Amit

Answer 1

您匹配从Server Version到MST的所有内容。只有匹配的部分稍后才会被preg_replace修改。正则表达式未涵盖的所有内容都保持不变。

因此，要在第一个锚点之前替换字符串部分以及后面的文本，您还必须先匹配它们。

= preg_replace("/^.*(Server Version)(.*?)(MST).*$/s", "$1$2$3",

请参阅^.*和.*$。两者都将匹配，但在替换模式中没有提及;所以他们就掉线了。

当然，在这种情况下使用preg_match()可能更简单......

Answer 2

您需要在正则表达式之后/之前捕获其他字符，例如：

/.+?(Server Version)([\s\S]*?)(MST).+?/s

's'是告诉preg匹配多行的标志，你需要它。

要删除html标记，请使用strip_tags。

PHP多行preg_replace提取HTML文档的一部分

2 个答案: