Question

我仍然试图掌握Regex并希望有人可以帮助进行简单的查询。我正在尝试解析我的网站的主页并提取H1标签。

  <?php
    $string_get = file_get_contents("http://davidelks.com/");


    $replace = "$1";

    $matches = preg_replace ("/<h1 class=\"title\"><a href=\"([A-Z]|[0-9]|[a-z]|[\s]|[\x21]|[\/]|[\-]|[\.]|[\£]|[\:])*\">([A-Z]|[0-9]|[a-z]|[\s]|[\x21]|[\/]|[\-]|[\.]|[\£]|[\:])*<\/a><\/h1>/", $replace, $string_get, 1);

    $string_construct = "Mum " . $matches .  " Dad";

    echo ($string_construct);

    ?>

但是，它不是仅使用$ 1令牌显示第一个HTML链接，而是仅拉入整个页面。

有人可以帮忙吗？

Answer 1

这似乎可以通过DOM parser轻松完成：

libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->load('http://davidelks.com/');
$h1 = $dom->getElementsByTagName('h1')->item(0);
echo $h1->textContent;

你应该得到：

Let's make things happen in and around Stoke-on-Trent

注意：我不确定这是您的网站还是您管理的网站，但HTML网页中不应包含多个<h1>标记（是一对夫妇在主页上。）

Answer 2

错误在于您使用preg_replace。您想要提取，使用preg_match的内容：

<?php
 $text = file_get_contents("http://davidelks.com/");

 preg_match('#<h1 class="title"><a href="([\w\s\x21\/\-\.\£\:]*)">([^<>]*)</a></h1>#', $text, $match);

 echo "Mum " . $match[1] .  " Dad";
?>

特别注意您可以组合字符类。您不需要[A-Z]|[a-z]|[..]，因为您可以将其合并到一个[A-Za-z...]方括号列表中。

如果要在其中搜索双引号，也尝试对PHP字符串使用单引号。这节省了大量无关的逃逸。与正则表达式周围的替代附件#而不是/一样。

Answer 3

使用DOM解析器会更容易。但是如果你想用正则表达式做，你应该使用php中的preg_match_all函数：

preg_match_all("/<h1 class=\"title\"><a href=\"([A-Z]|[0-9]|[a-z]|[\s]|[\x21]|[\/]|[\-]|[\.]|[\£]|[\:])*\">([A-Z]|[0-9]|[a-z]|[\s]|[\x21]|[\/]|[\-]|[\.]|[\£]|[\:])*<\/a><\/h1>/",$string_get,$matches);
var_dump($matches);

正则表达式查询 - 有人可以帮忙吗？

3 个答案: