Question

我试图从网站“窃取”产品名称，以便自己列出。我希望将这些值存储在一个数组中。我目前已经通过cURL成功打印出来并剥离了所有样式。

这是我的代码：

<?php
$ch = curl_init("http://www.nrs.com/category/3101/whitewater-kayaking/helmets");
$fp = fopen("example_homepage.txt", "w");
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_exec($ch);
curl_close($ch);
fclose($fp);

$website = file_get_contents('example_homepage.txt');
//COLLECTED AND STORED WEBSITE AS VARIABLE

preg_match_all('#\<h2>(.+?)\<\/h2>#s', $website, $unfiltered);

$products = array_pop($unfiltered);
$remove_how_much = (count($unfiltered[0]))-(array_search('Follow Us:',$products));


for($count=1;$count<=$remove_how_much;$count++) {
    array_pop($products);
}

for($counter=0;$counter<=(count($products)-1);$counter++) {
    $explode1 = explode('>',$products[$counter]);
    $explode2 = explode ('</a',$explode1[1]);
    echo $explode2[0];
    echo '<br/>';
}

?>

快速测试一下，你会看到它打印出来。我希望将这些值保存到一个数组中，检查是否有重复，并取出单词

- Closeout

来自所有价值观。

我也需要检查其他分页页面，

所以，我需要从

循环

http://www.nrs.com/category/3101/whitewater-kayaking/helmets?pg=1

到

http://www.nrs.com/category/3101/whitewater-kayaking/helmets?pg=2

等等，直到收到错误或重复的页面。

有什么想法吗？

此外，有没有办法改善我当前的代码，以便更有效地抓住它。

Answer 1

使用PHP Simple HTML DOM Parser

<?

include("simple_html_dom.php");

$html = file_get_html('http://www.nrs.com/category/3101/whitewater-kayaking/helmets?ppg=all');


foreach($html->find('h2') as $element)
       echo $element->plaintext."<br />";

/* OUTPUT
WRSI Trident Composite Helmet
WRSI Moment Fullface Helmet With Vents
WRSI Moment Fullface Helmet Without Vents
WRSI Current Pro Helmet
WRSI Current Helmet Without Vents
WRSI Current Helmet Without Vents
WRSI Current Helmet With Vents
WRSI Current Helmet With Vents
WRSI Current Rescue Helmet without Vents
WRSI Current Rescue Helmet with Vents
WRSI Limited Edition Current Helmet
NRS Chaos Helmet - Side Cut - Closeout
...
*/
?>

首页http://simplehtmldom.sourceforge.net/

将数据从网站cURL的for循环存储到数组中

1 个答案: