基于php的crawler / scraper用于加载jquery的页面

时间:2016-09-22 12:28:23

标签: php jquery api curl web-scraping

我正在编写一个加载电子商务网站html页面的脚本。我的脚本很好但是这可以在页面上为我提供由jquery添加的html。

有什么方法可以解决这个问题吗?

我使用下面的代码: -

目前我正在使用元标记来获取数据。

$url = "https://www.flipkart.com/pureit-classic-23-l-gravity-based-water-purifier/p/itmefqycwh7mgwan?pid=WAPEFQY5YEWY4ZGT&srno=s_1_5&otracker=search&lid=LSTWAPEFQY5YEWY4ZGTO514UM&qH=d5e3b0f34459bd7d";

$response = getPriceFromFlipkart($url);
echo '<pre>';
print_r($response);
echo '</pre>';
/* Returns the response in JSON format */

function getPriceFromFlipkart($url) {

    $curl = curl_init($url);
    curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 10.10; labnol;) ctrlq.org");
    curl_setopt($curl, CURLOPT_FAILONERROR, true);
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
    $html = curl_exec($curl);
    curl_close($curl);

    $regexTitle = '/<meta name="og_title" property="og:title" content="([^"]*)"/';
    preg_match($regexTitle, $html, $title);
    $regexDesc = '/<meta name="Description" content="([^"]*)"/';
    preg_match($regexDesc, $html, $desc);
    $price = get_string_between($desc[1], 'Rs.', 'from');

    if ($price && $title) {
        $response = array("price" => "Rs. $price.00", "title" => $title[1], "Description" => $desc[1]);
    } else {
        $response = array("status" => "404", "error" => "We could not find the product details on Flipkart $url");
    }
    return $response;
}

function get_string_between($string, $start, $end) {
    $string = ' ' . $string;
    $ini = strpos($string, $start);
    if ($ini == 0)
        return '';
    $ini += strlen($start);
    $len = strpos($string, $end, $ini) - $ini;
    return trim(substr($string, $ini, $len));
}

0 个答案:

没有答案