下面是我的代码,它没有提供预期的结果。
首先它应该使用cURL
提供页面的完整html内容,然后使用regexp,当我直接htmlcontent
提供预期结果但使用curl不提供相同结果时提供预期结果。
假设当我将下面的内容传递给htmlcontent
变量,然后RegExp
提供正确的结果。
$htmlContent = '<table id="ctl00_pageContent_ctl00_productList" class="product-list" cellspacing="0" border="0" style="width:100%;border-collapse:collapse;">
<tr>
<td class="product-list-item-container" style="width:100%;">
<div class="product-list-item" onkeypress="javascript:return WebForm_FireDefaultButton(event, 'ctl00_pageContent_ctl00_productList_ctl00_imbAdd')">
<a href="/W10542314D/WDoorGasketandLatchSt.aspx">
<img class="product-list-img" src="/images/products/display/applianceparts.jpg" title="W10542314 D/W Door Gasket & Latch St " alt="W10542314 D/W Door Gasket & Latch St " border="0" />
</a>
<div class="product-list-options">
<h5><a href="/W10542314D/WDoorGasketandLatchSt.aspx">W10542314 D/W Door Gasket & Latch St</a></h5>
<div class="product-list-cost"><span class="product-list-cost-label">Online Price:</span> <span class="product-list-cost-value">$33.42</span></div>
</div>
';
以下是我的完整代码 -
<?php
$url = "http://www.universalapplianceparts.com/search.aspx?find=W10130694";
$ch1= curl_init();
curl_setopt ($ch1, CURLOPT_URL, $url );
curl_setopt($ch1, CURLOPT_HEADER, 0);
curl_setopt($ch1,CURLOPT_VERBOSE,1);
curl_setopt($ch1, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.0.3705; .NET CLR 1.1.4322; Media Center PC 4.0)');
curl_setopt ($ch1, CURLOPT_REFERER,'http://www.google.com'); //just a fake referer
curl_setopt($ch1, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch1,CURLOPT_POST,0);
curl_setopt($ch1, CURLOPT_FOLLOWLOCATION, 20);
$htmlContent= curl_exec($ch1);
echo $htmlContent;
$value=preg_match_all('/.*<div.*class=\"product\-list\-options\".*>.*<a href="(.*)">.*<\/a>.*<\/div>/s',$htmlContent,$matches);
print_r($matches);
$value=preg_match_all('/.*<div.*class=\"product\-list\-item\".*>.*<a href=\"(.*)\">.*<img.*>.*<\/div>/s',$htmlContent,$matches);
print_r($matches);
在此代码中它回显网页的htmlcontent然后使用regexp它应该在div之间返回href
的锚标记,类名为product-list-options
和product-list-item
当前输出是 -
http://www.universalapplianceparts.com/termsofservice.aspx
数组值中的预期输出 - /W10130694LatchAssyWhiteHandle.aspx
任何帮助都将不胜感激。
谢谢
答案 0 :(得分:2)
试试这个
df = pd.read_csv(io.StringIO(temp), header=0, names=range(3), encoding='utf8')
print df
0 1 2
0 1 5 7
1 2 7 8
2 3 1 9
3 4 8 6
4 1 5 3
输出
class="product-list-item".*?<a href="(.*?)".*?class="product-list-options"
说明:
MATCH 1
1. [23040-23075] `/W10130694LatchAssyWhiteHandle.aspx`
匹配class="product-list-item"
class="product-list-item"
匹配任何角色,尽可能少
.*?
匹配<a href="
<a href="
抓取href="(.*?)"
内的文字
href=""
匹配class="product-list-options"