如何使用PHP解析HTML页面?

时间:2010-08-21 13:21:13

标签: php html parsing html-parsing

解析HTML / JS代码以使用PHP获取信息。

www.asos.com/Asos/Little-Asos-Union-Jack-T-Shirt/Prod/pgeproduct.aspx?iid=1273626

看看这个页面,它是一个儿童服装店。这是他们的项目之一,我想指出尺寸部分。我们需要做的是获得此项目的所有尺寸,并检查尺寸是否可用。现在这个项目的所有尺寸都是:

3-4 years
4-5 years
5-6 years
7-8 years

如果尺寸可用,您怎么说?

现在先看一下这个页面并再次查看尺寸:

www.asos.com/Ralph-Lauren/Ralph-Lauren-Long-Sleeve-Big-Horse-Stripe-Rugby-Top/Prod/pgeproduct.aspx?iid=1111751

此项目具有以下尺寸:

12 months
18 months - Not Available
24 months

如您所见,18个月的尺寸不可用,尺寸旁边的“不可用”文字表示。

我们需要做的是去一个项目的页面,获取尺寸并检查每种尺寸的可用性。我怎么能用PHP做到这一点?

编辑:

添加了一个工作代码和一个新问题来解决。

工作代码但需要更多工作:

<?php

function getProductVariations($url) {

  //Use CURL to get the raw HTML for the page
  $ch = curl_init();
  curl_setopt_array($ch,
    array(
      CURLOPT_RETURNTRANSFER=>true,
      CURLOPT_HEADER => false,
      CURLOPT_URL => $url
    )
  );
  $raw_html = curl_exec($ch);

  //If we get an invalid response back from the server fail
  if ($raw_html===false) {
    throw new Exception(curl_error($ch));
  }

  curl_close($ch);

  //Find the variation JS declarations and extract them
  $raw_variations = preg_match_all("/arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct\[[0-9]+\].*Array\((.*)\);/",$raw_html,$raw_matches);

  //We are done with the Raw HTML now
  unset($raw_html);

  //Check that we got some results back
  if (is_array($raw_matches) && isset($raw_matches[1]) && sizeof($raw_matches[1])==$raw_variations && $raw_variations>0) {

    //This is where the matches will go
    $matches = array();

    //Go through the results of the bracketed expression and convert them to a PHP assoc array
    foreach($raw_matches[1] as $match) {

      //As they are declared in javascript we can use json_decode to process them nicely, they just need wrapping
      $proc=json_decode("[$match]");

      //Label the fields as best we can
      $proc2=array(
        "variation_id"=>$proc[0],
        "size_desc"=>$proc[1],
        "colour_desc"=>$proc[2],
        "available"=>(trim(strtolower($proc[3]))=="true"),
        "unknown_col1"=>$proc[4],
        "price"=>$proc[5],
        "unknown_col2"=>$proc[6],       /*Always seems to be zero*/
        "currency"=>$proc[7],
        "unknown_col3"=>$proc[8],
        "unknown_col4"=>$proc[9],       /*Negative price*/
        "unknown_col5"=>$proc[10],      /*Always seems to be zero*/
        "unknown_col6"=>$proc[11]       /*Always seems to be zero*/
      );

      //Push the processed variation onto the results array
      $matches[$proc[0]]=$proc2;

      //We are done with our proc2 array now (proc will be unset by the foreach loop)
      unset($proc2);
    }

    //Return the matches we have found
    return $matches;

  } else {
    throw new Exception("Unable to find any product variations");

  }
}


//EXAMPLE USAGE
try {
  $variations = getProductVariations("http://www.asos.com/Asos/Prod/pgeproduct.aspx?iid=803846");

  //Do something more useful here
  print_r($variations);


} catch(Exception $e) {
  echo "Error: " . $e->getMessage();
}

?>

以上代码有效,但在产品显示尺寸之前,产品需要先选择颜色时会出现问题。

喜欢这个:

http://www.asos.com/Little-Joules/Little-Joules-Stewart-Venus-Fly-Trap-T-Shirt/Prod/pgeproduct.aspx?iid=1171006

知道如何解决这个问题吗?

3 个答案:

答案 0 :(得分:3)

解决方案:

    function curl($url){
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL,$url);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
        return curl_exec($ch);
        curl_close ($ch);
    }

$html = curl('http://www.asos.com/pgeproduct.aspx?iid=1111751');

preg_match_all('/arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct\[(.*?)\] \= new Array\((.*?),\"(.*?)\",\"(.*?)\",\"(.*?)\"/is',$html,$bingo);

echo print_r($bingo);

链接:http://debconf11.com/stackoverflow.php

你现在独自一人:)

EDIT2:

好的,我们接近解决方案......

<script type="text/javascript">var arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct = new Array;
arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct[0] = new Array(1164,"12 months","SailingOrange","True","","59.00","0.00","£","","-59.00","0.00","0");
arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct[1] = new Array(1165,"18 months","SailingOrange","False","","59.00","0.00","£","","-59.00","0.00","0");
arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct[2] = new Array(1167,"24 months","SailingOrange","True","","59.00","0.00","£","","-59.00","0.00","0");
</script>

它不是通过ajax加载的,而是数组是javascript变量。您可以使用PHP解析此问题,您可以清楚地看到18个月是假的,这意味着它不可用。

编辑:

此大小是通过javascript加载的,因此您无法解析它们,因为它们不存在。 我只能提取这个......

<select name="drpdwnSize" id="ctl00_ContentMainPage_ctlSeparateProduct_drpdwnSize" onchange="drpdwnSizeChange(this, 'ctl00_ContentMainPage_ctlSeparateProduct', arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct);">
<option value="-1">Select Size</option>
</select>

您可以嗅探JS以检查是否可以根据产品ID加载尺寸。


首先你需要:http://simplehtmldom.sourceforge.net/ 忘记file_get_contents()它比cURL慢〜5。

然后解析这段代码(带有id ctl00_ContentMainPage_ctlSeparateProduct_drpdwnSize的html)

        <select id="ctl00_ContentMainPage_ctlSeparateProduct_drpdwnSize" name="ctl00$ContentMainPage$ctlSeparateProduct$drpdwnSize" onchange="drpdwnSizeChange(this, 'ctl00_ContentMainPage_ctlSeparateProduct', arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct);">

        <option value="-1">Select Size</option><option value="1164">12 months</option><option value="1165">18 months - Not Available</option><option value="1167">24 months</option></select>

然后,您可以使用preg_match(),explode(),str_replace()和其他来过滤掉您想要的值。我可以写,但我现在没有时间:)

答案 1 :(得分:1)

获取URL内容的最简单方法是依赖fopen包装器,并将file_get_contents与URL一起使用。您可以使用整洁的扩展来解析HTML并提取内容。 http://php.net/tidy

答案 2 :(得分:1)

您可以使用fopen()file_get_contents()下载该文件,正如Raoul Duke所说,但如果您有使用JavaScript DOM模型的经验,那么DOM extension可能会更容易使用比整齐。

我知道在PHP中默认启用了DOM扩展,但我不确定Tidy是否(手册页只是说它是“bundeled”,所以我怀疑它可能没有启用)。