Question

我在html中有以下字符串。

BookSelector.load([{"index":25,"label":"Science","booktype":"pdf","payload":"<script type=\"text\/javascript\" charset=\"utf-8\" src=\"\/\/www.192.168.10.85\/libs\/js\/books.min.js\" publisher_id=\"890\"><\/script>"}]);

我想从字符串中找到src和publisher_id。

为此我尝试以下代码

$regex = '#\BookSelector.load\(.*?src=\"(.*?)\"}]\)#s';

preg_match($regex, $html, $matches);

$match = $matches[1];

但它总是返回null。

我的正则表达式只选择src？

如果我需要解析BookSelector.load（）;

之间的整个字符串，那么我的正则表达式是什么？

Answer 1

为什么你的正则表达式不起作用？

首先，我会回答为什么你的正则表达式不起作用：

您在正则表达式中使用\B。它匹配任何与单词边界（\b）不匹配的位置，这不是您想要的。这种情况失败，导致整个正则表达式失败。
您的原始文字包含转义引号，但您的正则表达不会说明这些。

解决此问题的正确方法

将此任务拆分为多个部分，并使用可用的最佳工具逐一解决。

您需要的数据封装在JSON结构中。所以第一步显然是提取JSON内容。为此，您可以使用正则表达式。
获得JSON内容后，您需要对其进行解码以获取其中的数据。 PHP具有用于此目的的内置函数：json_decode()。将它与输入字符串一起使用，并将第二个参数设置为true，并且您将拥有一个很好的关联数组。
获得关联数组后，您可以轻松获取包含payload标记内容的<script>字符串。
如果您确定属性的顺序始终相同，则可以使用正则表达式来提取所需的信息。如果没有，最好使用PHP解析器（例如PHP DOMDocument）来执行此操作。

整个代码如下：

// Extract the JSON string from the whole block of text
if (preg_match('/BookSelector\.load\((.*?)\);/s', $text, $matches)) {

    // Get the JSON string and decode it using json_decode()
    $json    = $matches[1];
    $content = json_decode($json, true)[0]['payload'];

    $dom = new DOMDocument;
    $dom->loadHTML($content);

    // Use DOMDocument to load the string, and get the required values
    $script_tag   = $dom->getElementsByTagName('script')->item(0);
    $script_src   = $tag->getAttribute('src');
    $publisher_id = $tag->getAttribute('publisher_id');

    var_dump($src, $publisher_id);
}

输出：

string(40) "//www.192.168.10.85/libs/js/books.min.js"
string(3) "890"

php正则表达式中断

1 个答案:

为什么你的正则表达式不起作用？

解决此问题的正确方法