Question

我想从一个巨大的干草堆中找到页面标题，但是没有任何类或唯一ID，所以我不能在这里使用DOM解析器，我知道我必须使用正则表达式。这是我想要找到的例子：

<a href="http://example.com/xyz">
    Series Hell In Heaven information
</a>
<a href="http://example.com/123">
    Series What is going information
</a>

输出应为

的数组

[0] => Series Hell In Heaven information
[1] => Series What is going information

所有系列游戏都以系列开头，以信息结尾。从一大堆多件事我只想提取标题。目前我正在尝试使用正则表达式，但它不起作用，这就是我现在正在做的事情。

$reg = "/^Series\..*information$/";
$str = $html;
preg_match_all($reg, $str, $matches);
echo "<pre>";
    print_r($matches);
echo "</pre>";

我对制作正则表达式知之甚少。帮助会很感激。感谢

Answer 1

试试这个：

$str = '<a href="http://example.com/xyz">
    Series Hell In Heaven information
</a>
<a href="http://example.com/123">
    Series What is going information
</a>';
preg_match_all('/Series(.*?)information/', $str, $matches);
echo "<pre>";
    print_r($matches);
echo "</pre>";

捕获将在$ matches [2]中。基本上你的正则表达式因\.而不匹配。

[编辑]

如果您还需要单词Series和information，则无需捕获/Series.*?information/并在$ matches [0]中找到匹配项。

Answer 2

尝试

 preg_match_all('/(Series.+?information)/', $str, $matches );

作为

https://regex101.com/r/oJ0jZ4/1

正如我在评论中所说，删除文字\.点以及开始和结束锚点...我还会使用非贪婪的任何字符。 .+?

否则你可以匹配这个

Seriesinformation

如果系列或信息的外壳可能会发生变化，例如

系列....信息

在

中添加/i标志

     preg_match_all('/(Series.+?information)/i', $str, $matches );

外部捕获组并不是真的需要，但我觉得它在那里看起来更好，如果你只想要没有系列或信息的变量内容然后将捕获( )移动到那个位

 preg_match_all('/Series(.+?)information/i', $str, $matches );

请注意，您希望trim()匹配，因为它可能在开头和结尾都有空格，或者像这样将它们添加到regx。

 preg_match_all('/Series\s(.+?)\sinformation/i', $str, $matches );

但是这会将匹配的Series information排除在一个空格之外。

如果您想确定不匹配

等信息

[Series Hell In Heaven information Series Hell In Heaven information]

匹配所有这些你可以使用积极的外观

preg_match_all('/(Series.+?(?<=information))/i', $str, $matches );

相反，如果有可能它会包含两个信息词

   <a href="http://example.com/123">
        Series information is power information
   </a>

你可以这样做

    preg_match_all('/(Series[^<]+)</i', $str, $matches );

与<

中的</a最匹配

作为附注，您可以使用PHPQuery库（它是一个DOM解析器），并查找包含这些单词的a标记。

https://github.com/punkave/phpQuery

并且

https://code.google.com/archive/p/phpquery/wikis/Manual.wiki

使用类似

的内容

  $tags = $doc->getElementsByTagName("a:contains('Series)")->text();

这是一个用于解析HTML的优秀库

PHP Regex从大字符串中查找子字符串 - 匹配开始和结束

2 个答案: