解析HTML / JS代码以使用PHP获取信息。
www.asos.com/Asos/Little-Asos-Union-Jack-T-Shirt/Prod/pgeproduct.aspx?iid=1273626
看看这个页面,它是一个儿童服装店。这是他们的项目之一,我想指出尺寸部分。我们需要做的是获得此项目的所有尺寸,并检查尺寸是否可用。现在这个项目的所有尺寸都是:
3-4 years
4-5 years
5-6 years
7-8 years
如果尺寸可用,您怎么说?
现在先看一下这个页面并再次查看尺寸:
www.asos.com/Ralph-Lauren/Ralph-Lauren-Long-Sleeve-Big-Horse-Stripe-Rugby-Top/Prod/pgeproduct.aspx?iid=1111751
此项目具有以下尺寸:
12 months
18 months - Not Available
24 months
如您所见,18个月的尺寸不可用,尺寸旁边的“不可用”文字表示。
我们需要做的是去一个项目的页面,获取尺寸并检查每种尺寸的可用性。我怎么能用PHP做到这一点?
编辑:
添加了一个工作代码和一个新问题来解决。
工作代码但需要更多工作:
<?php
function getProductVariations($url) {
//Use CURL to get the raw HTML for the page
$ch = curl_init();
curl_setopt_array($ch,
array(
CURLOPT_RETURNTRANSFER=>true,
CURLOPT_HEADER => false,
CURLOPT_URL => $url
)
);
$raw_html = curl_exec($ch);
//If we get an invalid response back from the server fail
if ($raw_html===false) {
throw new Exception(curl_error($ch));
}
curl_close($ch);
//Find the variation JS declarations and extract them
$raw_variations = preg_match_all("/arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct\[[0-9]+\].*Array\((.*)\);/",$raw_html,$raw_matches);
//We are done with the Raw HTML now
unset($raw_html);
//Check that we got some results back
if (is_array($raw_matches) && isset($raw_matches[1]) && sizeof($raw_matches[1])==$raw_variations && $raw_variations>0) {
//This is where the matches will go
$matches = array();
//Go through the results of the bracketed expression and convert them to a PHP assoc array
foreach($raw_matches[1] as $match) {
//As they are declared in javascript we can use json_decode to process them nicely, they just need wrapping
$proc=json_decode("[$match]");
//Label the fields as best we can
$proc2=array(
"variation_id"=>$proc[0],
"size_desc"=>$proc[1],
"colour_desc"=>$proc[2],
"available"=>(trim(strtolower($proc[3]))=="true"),
"unknown_col1"=>$proc[4],
"price"=>$proc[5],
"unknown_col2"=>$proc[6], /*Always seems to be zero*/
"currency"=>$proc[7],
"unknown_col3"=>$proc[8],
"unknown_col4"=>$proc[9], /*Negative price*/
"unknown_col5"=>$proc[10], /*Always seems to be zero*/
"unknown_col6"=>$proc[11] /*Always seems to be zero*/
);
//Push the processed variation onto the results array
$matches[$proc[0]]=$proc2;
//We are done with our proc2 array now (proc will be unset by the foreach loop)
unset($proc2);
}
//Return the matches we have found
return $matches;
} else {
throw new Exception("Unable to find any product variations");
}
}
//EXAMPLE USAGE
try {
$variations = getProductVariations("http://www.asos.com/Asos/Prod/pgeproduct.aspx?iid=803846");
//Do something more useful here
print_r($variations);
} catch(Exception $e) {
echo "Error: " . $e->getMessage();
}
?>
以上代码有效,但在产品显示尺寸之前,产品需要先选择颜色时会出现问题。
喜欢这个:
知道如何解决这个问题吗?
答案 0 :(得分:3)
解决方案:
function curl($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
return curl_exec($ch);
curl_close ($ch);
}
$html = curl('http://www.asos.com/pgeproduct.aspx?iid=1111751');
preg_match_all('/arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct\[(.*?)\] \= new Array\((.*?),\"(.*?)\",\"(.*?)\",\"(.*?)\"/is',$html,$bingo);
echo print_r($bingo);
链接:http://debconf11.com/stackoverflow.php
你现在独自一人:)
EDIT2:
好的,我们接近解决方案......
<script type="text/javascript">var arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct = new Array;
arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct[0] = new Array(1164,"12 months","SailingOrange","True","","59.00","0.00","£","","-59.00","0.00","0");
arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct[1] = new Array(1165,"18 months","SailingOrange","False","","59.00","0.00","£","","-59.00","0.00","0");
arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct[2] = new Array(1167,"24 months","SailingOrange","True","","59.00","0.00","£","","-59.00","0.00","0");
</script>
它不是通过ajax加载的,而是数组是javascript变量。您可以使用PHP解析此问题,您可以清楚地看到18个月是假的,这意味着它不可用。
编辑:
此大小是通过javascript加载的,因此您无法解析它们,因为它们不存在。 我只能提取这个......
<select name="drpdwnSize" id="ctl00_ContentMainPage_ctlSeparateProduct_drpdwnSize" onchange="drpdwnSizeChange(this, 'ctl00_ContentMainPage_ctlSeparateProduct', arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct);">
<option value="-1">Select Size</option>
</select>
您可以嗅探JS以检查是否可以根据产品ID加载尺寸。
首先你需要:http://simplehtmldom.sourceforge.net/ 忘记file_get_contents()它比cURL慢〜5。
然后解析这段代码(带有id ctl00_ContentMainPage_ctlSeparateProduct_drpdwnSize的html)
<select id="ctl00_ContentMainPage_ctlSeparateProduct_drpdwnSize" name="ctl00$ContentMainPage$ctlSeparateProduct$drpdwnSize" onchange="drpdwnSizeChange(this, 'ctl00_ContentMainPage_ctlSeparateProduct', arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct);">
<option value="-1">Select Size</option><option value="1164">12 months</option><option value="1165">18 months - Not Available</option><option value="1167">24 months</option></select>
然后,您可以使用preg_match(),explode(),str_replace()和其他来过滤掉您想要的值。我可以写,但我现在没有时间:)
答案 1 :(得分:1)
获取URL内容的最简单方法是依赖fopen
包装器,并将file_get_contents
与URL一起使用。您可以使用整洁的扩展来解析HTML并提取内容。 http://php.net/tidy
答案 2 :(得分:1)
您可以使用fopen()
或file_get_contents()
下载该文件,正如Raoul Duke所说,但如果您有使用JavaScript DOM模型的经验,那么DOM extension可能会更容易使用比整齐。
我知道在PHP中默认启用了DOM扩展,但我不确定Tidy是否(手册页只是说它是“bundeled”,所以我怀疑它可能没有启用)。