Question

我正在尝试从网站源代码中获取一些数据。我想要做的是在/collections/(whatever that follows here)之后获取所有内容。我的模式与我所寻找的“最”匹配。当我的preg_match_all使用“＆amp;”进入模式时会出现问题，此时它只会读到“＆amp;”的点。并停止阅读其余部分。这是我的剧本：

$homepage = file_get_contents('http://www.harrisfarm.com.au/');
$pattern = '/collections([\w-&\/]*)/i';
preg_match_all($pattern, $processedHomePage, $collections);
print_r($collections);

请注意，在这样打印时，“＆amp;”之后的事情被忽略了，这意味着它会让我这样：

/collections/seafood/Shellfish-&

但是当我在一个字符串上进行模式匹配时，如下所示：

 $subject = 'a href="/collections/organic/Pantry/sickmonster/grandma"  <a href="/collections/seafood/Shellfish-&-Crustaceans">Oysters, Shellfish & Crustaceans';

它让我得到了我想要的一切：

/collections/seafood/Shellfish-&-Crustaceans

所以我想知道......为什么会这样？我真的很难过。

Answer 1

在preg_match_all中使用$ homepage而不是$ processedHomePage时，提供的代码没有问题。

顺便说一句：你应该在方括号中转义减号（或者在方括号中用表达式的开头或结尾写出来），但令人惊讶的是它在你的情况下没有区别：

$ pattern =＆＃39; / collections（[ - \ w＆amp; /] *）/ i＆＃39;;

有关详细信息，请参阅http://php.net/manual/regexp.reference.meta.php。

Answer 2

我弄清楚问题是什么 - 也许这会在以后帮助别人。

我曾尝试使用htmlspecialchars()转换网址http://www.harrisfarm.com.au/，然后将其作为字符串读取。这将一些特殊字符（如&和其他一些东西）转换为具有多个字符的内容。

&的转换将其转换为&，其;，而且不在我的正则表达式中。由于;不是正则表达式的一部分，因此正则表达式在此时停止匹配。

Answer 3

试试这个：

$re = "/\\/collections([\\w\\-\\&\\/;]*)/mi";
$str = "<a href=\"/collections/seafood/Shellfish-&amp;-Crustaceans\">Oysters, Shellfish & Crustaceans';\n<a href=\"/collections/seafood/Shellfish-&-Crustaceans\">Oysters,collections Shellfish & Crustaceans';";

preg_match_all($re, $str, $matches);

live demo

您的更新代码

$homepage = file_get_contents('http://www.harrisfarm.com.au/');
$pattern = "/\\/collections([\\w\\-\\&\\/;]*)/mi";
preg_match_all($pattern, $homepage, $collections);
print_r($collections);

PHP preg_match_all没有正确匹配

3 个答案: