我有这个脚本。
我需要的是它只将包含"/product-product/"
的链接写入文件items.txt
。嗯,不是wohle链接,而是10 didget item-nr
产品产品/的 1007687980
在示例中,您看到item-nr是/ 100。我正在搜索一个类别中的项目,其中nrs是/ 100的东西。但现在不再需要了。
$keyword= $_SERVER['QUERY_STRING'];
$site=1;
while ($site<30) {
$content = file_get_contents('http://www.example.com/?keywords='. $keyword .'&x=0&y=0&pagecount='.$site.'&sort=sort');
$html = $content;
$dom = new DomDocument();
@$dom->loadHTML($html);
$urls = $dom->getElementsByTagName('a');
$lookfor='http://www.example.com';
foreach ($urls as $url){
if(substr($url->getAttribute('href'),0,strlen($lookfor))==$lookfor){
$tubeurl = str_replace ("http://www.example.com","",$url->getAttribute('href'));
$tubeurl = substr($tubeurl, strpos($tubeurl,"/product-product/100")+17, 10);
file_put_contents("items.txt", "" .$tubeurl. "
", FILE_APPEND | LOCK_EX);// this line must remain, it makes it so that there is a new line \n wouldn't work
}
} $site++; echo $site;}
正则表达式将是一个解决方案。但我在这里阅读了Stackoverflow,这对服务器来说是很多工作。
答案 0 :(得分:0)
将产品ID放入$ 1的简单正则表达式应该可以解决问题。你可能想要一些更多的逻辑来确保$ 1。修改它,使$ 1总是10位数。
$keyword= $_SERVER['QUERY_STRING'];
$site=1;
while ($site<30) {
$content = file_get_contents('http://www.domain.com/?keywords=' . $keyword . '&x=0&y=0&pagecount='.$site.'&sort=sort');
$html = $content;
$dom = new DomDocument();
@$dom->loadHTML($html);
$urls = $dom->getElementsByTagName('a');
$lookfor='http://www.domain.com';
foreach ($urls as $url){
if(substr($url->getAttribute('href'),0,strlen($lookfor))==$lookfor){
$tubeurl = str_replace ("http://www.domain.com","",$url->getAttribute('href'));
preg_match("/^http.*/product-product\/(\d{10})$/", $tubeurl, $matches);
file_put_contents("items.txt", $1,
FILE_APPEND | LOCK_EX); // this line must remain, it makes it so that there is a new line \n wouldn't work
}
}
$site++;
echo $site;
}