在Foreach循环中遇到preg_replace()问题

时间:2016-02-06 20:39:22

标签: php foreach preg-replace domdocument

长话短说,我的客户因为争议而无法访问他们的服务器,他们需要所有的俱乐部照片,所以我可以建立一个新网站。我不得不通过URL下载它们,它们由PHP输出处理,它提供不同的大小以减少服务器负载。

其中有超过3000种,我不打算浪费时间逐一做这件事。

所以,我决定编写一个快速且[非常]脏的PHP脚本,它将使用Campaign.where("qualification.male_gender: true")抓取页面,查找图像的链接,跨每个专辑,然后跨专辑子页面。 / p>

一切正常,除了脚本的一个特定部分,它在相册页面上查找:

(1)指向图像的链接,

DOMDocument

(2)指向后续页面的链接,即

<div class='imagethumb'>
    <a href="/gallery/index.php?album=blowout1&image=blahblah.jpg" title="Blahblah>
        <img src="/gallery/index.php?album=blowout1&image=blahblah_thumb.jpg />
    </a>
</div>

(3)专辑的链接&#34;最后一页&#34;或&#34; ...&#34;

<li>
    <a href="/gallery/index.php?album=beginning&amp;page=2" title="Page 2">2</a>
</li>

这是脚本的相关部分:

<li>
    <a href="/gallery/index.php?album=recognition&page=9" title="Page 9">...</a>
</li>

如果脚本找到子页面链接,它会添加到//$url is an argument in the function wrapping this script //look on albums for links foreach ($album_links as $a_url) { $album_html = file_get_contents($a_url['url']); $album = new DOMDocument; $album->loadHTML($album_html); $i_links = $album->getElementsByTagName('a'); $album_title = $album->getElementsByTagName('title')->item(0)->textContent; //to keep track of the number of sub-page links found, exclude page 1 $num_page_lnks = 1; //search through all links on the page, look for: foreach ($i_links as $link) { //Links contained in div with class='imagethumb' if ($link->parentNode->getAttribute('class') == 'imagethumb' ) { array_push($image_links, ["album" => str_replace(" | ", "", $album_title), "title" => $link->getAttribute('title'), "url" => "http://" . parse_url($url, PHP_URL_HOST) . $link->getAttribute('href') . "&p=*full-image"]); } //links contained in li with no class, has a page number in the title, and is not a "..." link elseif ($link->parentNode->getAttribute('class') == '' && preg_match('/Page\040\d*/', $link->getAttribute('title')) && $link->textContent != "...") { //add to the number of sub page links found $num_page_lnks++; array_push($image_page_links, "http://" . parse_url($url, PHP_URL_HOST) . $link->getAttribute('href')); } //links containing the text "..." (link to last album page, if more than 7 pages) elseif($link->textContent == "...") { //Parse the url into parts $url_parse=[]; parse_str($link->getAttribute('href'), $url_parse); //Last Page links appear when greater than 7 pages, so start at 8 ($num_page_links + 1) for ($count = ($num_page_lnks + 1); $count < ($url_parse['page'] + 1); $count++) { array_push($image_page_links, "http://" . parse_url($url, PHP_URL_HOST) . preg_replace("/[^\=]\d+$/", $count, $link->getAttribute('href'))); } } } unset($album); unset($album_html); unset($i_links); } ,这样当它找到$num_page_links链接时,它会知道在创建中间链接时从哪里开始页面链接

这就是回归:

"..."

在该对象中确实有适量的子页面,但问题是:

  1. 当有7个或更少的相册页面(6个子页面)时,脚本效果很好
  2. 当有8个专辑页面(7个子页面)时,脚本可以正常使用
  3. 当有9个专辑页面时(8个子页面 - [1]当前页面,[2] [3] [4] [5] [6] [7] [...]最后一页( 9)),脚本加倍[/ li>
  4. 当有10个或更多相册页面时,没问题。
  5. 我无法弄清楚我做错了什么。

    编辑:

    以下是{ "0": "http://club.website.com/gallery/index.php?album=beginning&page=2", "1": "http://club.website.com/gallery/index.php?album=beginning&page=3", "2": "http://club.website.com/gallery/index.php?album=history&page=2", "3": "http://club.website.com/gallery/index.php?album=history&page=3", "4": "http://club.website.com/gallery/index.php?album=history&page=4", "5": "http://club.website.com/gallery/index.php?album=history&page=5", "6": "http://club.website.com/gallery/index.php?album=history&page=6", "7": "http://club.website.com/gallery/index.php?album=history&page=7", "8": "http://club.website.com/gallery/index.php?album=memorial&page=2", "9": "http://club.website.com/gallery/index.php?album=memorial&page=3", "10": "http://club.website.com/gallery/index.php?album=memorial&page=4", "11": "http://club.website.com/gallery/index.php?album=memorial&page=5", "12": "http://club.website.com/gallery/index.php?album=memorial&page=6", "13": "http://club.website.com/gallery/index.php?album=memorial&page=7", "14": "http://club.website.com/gallery/index.php?album=memorial&page=9", "15": "http://club.website.com/gallery/index.php?album=memorial&page=9", "16": "http://club.website.com/gallery/index.php?album=members&page=2", "17": "http://club.website.com/gallery/index.php?album=members&page=3", "18": "http://club.website.com/gallery/index.php?album=members&page=4", "19": "http://club.website.com/gallery/index.php?album=members&page=5", "20": "http://club.website.com/gallery/index.php?album=members&page=6", "21": "http://club.website.com/gallery/index.php?album=members&page=7", "22": "http://club.website.com/gallery/index.php?album=members&page=8", "23": "http://club.website.com/gallery/index.php?album=members&page=9", "24": "http://club.website.com/gallery/index.php?album=members&page=10", "25": "http://club.website.com/gallery/index.php?album=members&page=11", "26": "http://club.website.com/gallery/index.php?album=toy_run&page=2", "27": "http://club.website.com/gallery/index.php?album=toy_run&page=3", "28": "http://club.website.com/gallery/index.php?album=toy_run&page=4", "29": "http://club.website.com/gallery/index.php?album=toy_run&page=5", "30": "http://club.website.com/gallery/index.php?album=toy_run&page=6", "31": "http://club.website.com/gallery/index.php?album=toy_run&page=7", "32": "http://club.website.com/gallery/index.php?album=toy_run&page=8", "33": "http://club.website.com/gallery/index.php?album=recognition&page=2", "34": "http://club.website.com/gallery/index.php?album=recognition&page=3", "35": "http://club.website.com/gallery/index.php?album=recognition&page=4", "36": "http://club.website.com/gallery/index.php?album=recognition&page=5", "37": "http://club.website.com/gallery/index.php?album=recognition&page=6", "38": "http://club.website.com/gallery/index.php?album=recognition&page=7", "39": "http://club.website.com/gallery/index.php?album=recognition&page=9", "40": "http://club.website.com/gallery/index.php?album=recognition&page=9", "41": "http://club.website.com/gallery/index.php?album=blowout1&page=2", "42": "http://club.website.com/gallery/index.php?album=blowout1&page=3", "43": "http://club.website.com/gallery/index.php?album=blowout1&page=4", "44": "http://club.website.com/gallery/index.php?album=blowout1&page=5", "45": "http://club.website.com/gallery/index.php?album=blowout1&page=6", "46": "http://club.website.com/gallery/index.php?album=blowout1&page=7", "47": "http://club.website.com/gallery/index.php?album=blowout1&page=8", "48": "http://club.website.com/gallery/index.php?album=blowout1&page=9", "49": "http://club.website.com/gallery/index.php?album=blowout1&page=10" } 的源HTML代码:

    $i_links

1 个答案:

答案 0 :(得分:1)

问题出在最后一个嵌套循环中:

//Last Page links appear when greater than 7 pages, so start at 8 ($num_page_links + 1)
for ($count = ($num_page_lnks + 1); $count < ($url_parse['page'] + 1); $count++) {
     array_push($image_page_links,  "http://" . parse_url($url, PHP_URL_HOST) . preg_replace("/[^\=]\d+$/", $count, $link->getAttribute('href')));
 }

当您到达第7个子链接(包含文字内容&#34; ...&#34;)时,$num_page_lnks变量的值为7$url_parse['page']的值为{{ 1}}。因此,将有两次迭代,其中9变量将分配$count,然后 - 8
但是......这些链接保持不变:

9

因为你的正则表达式没有做出预期的替换。

"http://club.website.com/gallery/index.php?album=recognition&page=9"
"http://club.website.com/gallery/index.php?album=recognition&page=9"

将正则表达式模式更改为此var_dump(preg_replace("/[^\=]\d+$/",8,"/gallery/index.php?album=recognition&amp;page=9")); // will output: string(47) "/gallery/index.php?album=recognition&page=9" 或考虑其他逻辑。