来自xpath屏幕抓取的意外输出

时间:2014-04-09 13:02:01

标签: php xpath screen-scraping

尽管xpath是正确的(据我所知),此代码仍为outputting strangely

我的意思是,很多was_pricenow_price值未从页面中删除,因此返回£

知道什么是错的吗?

Here's the site我正在抓人。

代码:

function scrape($list_url, $shop_name, $photo_location, $photo_url_root, $product_location, $product_url_root, $was_price_location, $now_price_location, $gender, $country, mysqli $con)
{

    $html = file_get_contents($list_url);
    $doc = new DOMDocument();
    libxml_use_internal_errors(TRUE);

    if(!empty($html))
    {
        $doc->loadHTML($html);
        libxml_clear_errors(); // remove errors for yucky html
        $xpath = new DOMXPath($doc);

        /* FIND LINK TO PRODUCT PAGE */

        $products = array();

        $row = $xpath->query($product_location);

        /* Create an array containing products */
        if ($row->length > 0)
        {            
            foreach ($row as $location)
            {
                $product_urls[] = $product_url_root . $location->getAttribute('href');
            }
        }
        else { echo "product location is wrong<br>";}

        $imgs = $xpath->query($photo_location);

        /* Create an array containing the image links */
        if ($imgs->length > 0)
        {            
            foreach ($imgs as $img)
            {
                $photo_url[] = $photo_url_root . $img->getAttribute('src');
            }
        }
        else { echo "photo location is wrong<br>";}

        $was = $xpath->query($was_price_location);

        /* Create an array containing the was price */
        if ($was->length > 0)
        {
            foreach ($was as $price)
            {
                $stripped = preg_replace("/[^0-9,.]/", "", $price->nodeValue);
                $was_price[] = "&pound;".$stripped;
            }
        }
        else { echo "was price location is wrong<br>";}

        $now = $xpath->query($now_price_location);

        /* Create an array containing the sale price */
        if ($now->length > 0)
        {
            foreach ($now as $price)
            {
                $stripped = preg_replace("/[^0-9,.]/", "", $price->nodeValue);
                $now_price[] = "&pound;".$stripped;
            }
        }
        else { echo "now price location is wrong<br>";}

        $result = array();

        /* Create an associative array containing all the above values */
        foreach ($product_urls as $i => $product_url)
        {
            $result[] = array(
                'product_url' => $product_url,
                'shop_name' => $shop_name,
                'photo_url' => $photo_url[$i],
                'was_price' => $was_price[$i],
                'now_price' => $now_price[$i]
            );
        }

        echo json_encode($result);

    }
    else
    {
        echo "this is empty";
    }
}

$list_url = "http://www.asos.com/Women/Sale/70-Off-Sale/Cat/pgecategory.aspx?cid=16903&pge=0&pgesize=1002&sort=-1";
$shop_name = "ASOS";
$photo_location = "//ul[@id='items']/li/div[@class='categoryImageDiv']/*[1]/img";
$photo_url_root = "";
$product_location = "//ul[@id='items']/li/div[@class='categoryImageDiv']/*[1]";
$product_url_root = "http://www.asos.com";
$was_price_location = "//ul[@id='items']/li/div[@class='productprice']/span[@class='price' or @class='recRP rrp']"; // leave recRP rrp
$now_price_location = "//ul[@id='items']/li/div[@class='productprice']/span[@class='prevPrice previousprice' or @class='price outlet-current-price']"; // leave outlet-current-price
$gender = "f";
$country = "UK";

scrape($list_url, $shop_name, $photo_location, $photo_url_root, $product_location, $product_url_root, $was_price_location, $now_price_location, $gender, $country, $con);

1 个答案:

答案 0 :(得分:0)

我在计算每个网站的匹配数量,看起来你的was_price有1563次点击,而你的now_price只有1440次点击。这告诉我,你的Xpath在100%的情况下都不起作用,或者某些文章只有一个价格。

因此,您必须确保所有XPath表达式返回相同数量的结果,以便:products = new_price = old_price = images