我正在编写并学习一个简单的爬虫脚本来阅读网站中的所有链接。我的模式有问题,我不明白为什么这不起作用。
链接在网站的源代码中如下所示:
<a href="?ObjectPath=/Shops/154567062/Categories/Handlauf/%22Handlauf%20Holz%22">Handlauf Holz </a>
我的模式和功能如下所示:
preg_match_all( '/ObjectPath.*"/', $contentrow, $output, PREG_SET_ORDER
它适用于上半部分,但之后它会打破输出。这里是输出的示例,其中包含其中的内容:
ObjectPath = /商店/ 15456062 /类别“&GT; - 的 GESAMTANGEBOT -Handläufe
ObjectPath = /商店/ 15456062 /产品/%22Handlauf%20Edelstahl%20DS01%22 /子产品/%22Handlauf%20Edelstahl%20DS%2001%20014%22安培; #ProductRatings“
ObjectPath = /商店/ 15456062 /分类/ CustomerInformation“
ObjectPath = / Shops / 15456062 / Products /%22Handlauf%20Edelstahl%20DS01%22 / SubProducts /%22Handlauf%20Edelstahl%20DS%2001%20014%22&amp; ChangeAction = SelectSubProduct“method =”post“
源代码中部件来自的部分如下所示:
<a class="BreadcrumbItem" href="?ObjectPath=/Shops/345456456/Categories">-GESAMTANGEBOT-</a><a class="BreadcrumbItem" href="?ObjectPath=/Shops/1234346q/Categories/Handlauf">Handläufe</a><a class="BreadcrumbItem" href="?ObjectPath=/Shops/15456062/Categories/Handlauf/%22Handlauf%20Edelstahl%22">Handläufe Edelstahl</a>
我不明白,为什么部分-GESAMTANGEBOT-被纳入模式。 “应该完成吗?
谢谢!
这里是完整的脚本:
<?php
header('Content-Type: text/html; charset=utf-8');
function getPage($url){
// Prüfung ob cURL installiert ist?
if (!function_exists('curl_init')){
die('Curl not initialed');
}
// Array mit den cURL-Einstellungen
$options = array(
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HEADER => false,
CURLOPT_ENCODING => "",
CURLOPT_CONNECTTIMEOUT => 120,
CURLOPT_TIMEOUT => 120,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_AUTOREFERER => true,
CURLOPT_MAXREDIRS => 10
);
$ch = curl_init( $url );
curl_setopt_array( $ch, $options );
$content = curl_exec( $ch );
$err = curl_errno( $ch );
$errmsg = curl_error( $ch );
$header = curl_getinfo( $ch );
curl_close( $ch );
$header['errno'] = $err;
$header['errmsg'] = $errmsg;
$header['content'] = $content;
return $header;
}
$url = "http:/domain.com/epages/23455467.sf/de_DE/?ObjectPath=/Shops/15456062/Products/%22Handlauf%20Edelstahl%20DS01%22/SubProducts/%22Handlauf%20Edelstahl%20DS%2001%20014%22";
$domain = 'http://www.domain.com/epages/452563456.sf/de_DE/?';
$content = getPage($url);
$i=0;
foreach ($content as $contentrow) {
//go through content and look for links
if (preg_match_all( '/ObjectPath(.*)"/', $contentrow, $output, PREG_SET_ORDER )) {
$i++;
echo '<h1>'.$i.'</h1>';
foreach ($output as $row) {
$url= $domain.$row[0];
//echo '<a href="'.$url.'">'.$url.'</a>';
echo $url;
echo '<br /><h2>onerow</h2><br />';
}
}
}
//print_r($content);
我忘了提及,我在输出上面收到了这个警告:
警告:preg_match_all()要求参数2为字符串,在第48行的C:\ xampp \ htdocs \ scripts \ readratings.php中给出数组
答案 0 :(得分:0)
使用
$contentrow = '<a href="?ObjectPath=/Shops/154567062/Categories/Handlauf/%22Handlauf%20Holz%22">Handlauf Holz </a>';
preg_match_all( '/ObjectPath(.*)"/', $contentrow, $output, PREG_SET_ORDER);
print_r($output);
输出:
Array
(
[0] => Array
(
[0] => ObjectPath=/Shops/154567062/Categories/Handlauf/%22Handlauf%20Holz%22"
[1] => =/Shops/154567062/Categories/Handlauf/%22Handlauf%20Holz%22
)
)
答案 1 :(得分:0)
如果我理解正确,你会有类似的事情:
<a class="BreadcrumbItem" href="?ObjectPath=/Shops/345456456/Categories">-GESAMTANGEBOT-</a><a class="BreadcrumbItem" href="?ObjectPath=/Shops/1234346q/Categories/Handlauf">Handläufe</a><a class="BreadcrumbItem" href="?ObjectPath=/Shops/15456062/Categories/Handlauf/%22Handlauf%20Edelstahl%22">Handläufe Edelstahl</a>
你想要所有这些部分:
ObjectPath=/Shops/345456456/Categories
ObjectPath=/Shops/1234346q/Categories/Handlauf
ObjectPath=/Shops/15456062/Categories/Handlauf/%22Handlauf%20Edelstahl%22
虽然我不知道为什么你有这个奇怪的输出,但你应该能够通过lazy运算符得到你想要的东西。这应该做你想要的:
/ObjectPath(.*?)"/
因为它将停在第一个“。 在这种情况下,它相当于:
/ObjectPath([^"]*)"/
虽然不是一般情况。