Question

我正在尝试查找特定页面的下一页链接（我在此处将该特定页面称为current page）。我正在使用的程序中的current page是

http://en.wikipedia.org/wiki/Category:1980_births

我从next page link中提取的current page是以下

http://en.wikipedia.org/w/index.php?title=Category:1980_births&pagefrom=Alexis%2C+Toya%0AToya+Alexis#mw-pages

但是，当file_get_contents（）函数加载next page link时，它正在获取current page内容,,,

代码是

<?php

$string = file_get_contents("http://en.wikipedia.org/wiki/Category:1980_births");  //Getting contents of current page , 
preg_match_all("/\(previous page\) \(<a href=\"(.*)\" title/",  $string,$matches);    // extracting the next_page_link from the current page contents

foreach ($matches[1] as $match) {
break;
}

$next_page_link = $match;  
$next_page_link =  "http://en.wikipedia.org" . $next_page_link; //the next_link will have only the path , does't contain the domain name ,,, so i am adding the domain name here, this does't make any impact on the problem statement

$string1 = file_get_contents($next_page_link);
echo $next_page_link;
echo $string1;

?>

根据代码string1应该有next_page_link's个内容，而只是获取current page的内容。

Answer 1

在原始网站的来源中，链接具有实体编码的＆符号（请参阅Do I encode ampersands in <a href…>?）。当您单击锚点时，浏览器会正常解码它们，但您的抓取代码却没有。比较

match ... with ...

与

http://en.wikipedia.org/ ... &amp;pagefrom=Alexis%2C+Toya%0AToya+Alexis#mw-pages

这个格式错误的查询字符串实际上是你传递给http://en.wikipedia.org ... &pagefrom=Alexis%2C+Toya%0AToya+Alexis#mw-pages的。您可以将它们转换回常规的＆符号，如下所示：

file_get_contents

与broswer相比，file_get_contents（）函数加载不同的页面

1 个答案: