如何使用php中的curl从其他网站提取数据

时间:2016-04-23 06:19:26

标签: php jquery html curl web-scraping

我想从亚马逊最佳交易中获取数据url.and仅显示产品部分而不是整个ie。标题和侧边栏限制为8个产品。 我在php中使用curl和简单的html dom

include_once("php/simple_html_dom.php");
//use curl to get html content
function getHTML($url,$timeout)
{
       $ch = curl_init($url); // initialize curl with given url
       curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER["HTTP_USER_AGENT"]); // set  useragent
       curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // write the response to a variable
       curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow redirects if any
       curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); // max. seconds to execute
       curl_setopt($ch, CURLOPT_FAILONERROR, 1); // stop when it encounters an error
       return @curl_exec($ch);
}
echo $html=getHTML("http://www.amazon.in/gp/goldbox/ref=nav_topnav_deals",10);

?>

但问题是它拉了所有的内容,但我只想要产品div部分 和亚马逊产品的div容器

<div id="100_dealView_0" class="a-section a-spacing-none tallCellView gridColumn4 singleCell">

        <div class="a-section dealContainer">

    <div class="a-section backGround layer">
    </div>

    <div class="a-section layer">

            <div class="a-row dealContainer dealTile">


        <a id="dealImage" class="a-link-normal" href="https://www.amazon.in/s/ref=gbps_img_s-4_0227_af8a024a?fst=as%3Aoff&amp;rh=n%3A1571283031%2Cn%3A1983396031%2Ck%3A23rdApril_runningshoes_dotdlist%2Cp_76%3A1318482031%2Cp_6%3AA14FG3FHN6HO9H&amp;keywords=23rdApril_runningshoes_dotdlist&amp;ie=UTF8&amp;qid=1460093112&amp;rnid=1318474031&amp;smid=A14FG3FHN6HO9H&amp;pf_rd_p=900470227&amp;pf_rd_s=slot-4&amp;pf_rd_t=701&amp;pf_rd_i=gb_main&amp;pf_rd_m=A1VBAL9TL5WCBF&amp;pf_rd_r=13ED3AZVD21FX9VX9SS1">
            <div class="a-row a-spacing-base a-spacing-top-base imageBlock">
                <div class="a-row dealContainer">
                    <div class="a-row layer">
                        <img alt="" src="https://images-na.ssl-images-amazon.com/images/I/51%2BpumuEs%2BL._AA210_.jpg" data-a-hires="https://images-na.ssl-images-amazon.com/images/I/51%2BpumuEs%2BL._AA420_.jpg">
                    </div>
                    <div class="a-row layer backGround">
                    </div>
                </div>
            </div>
        </a>



                    <div class="a-row a-spacing-mini">


        <span class="a-size-mini a-color-base dotdBadge">DEAL OF THE DAY</span>

</div>

                <div class="a-row a-spacing-mini">

            <div class="a-row priceBlock unitLineHeight">
                <span class="a-size-medium a-color-base inlineBlock unitLineHeight">₹549 - ₹5,399</span>
            </div>

</div>
                <div class="a-row a-spacing-mini">

        <div class="a-row unitLineHeight">
            <span class="a-size-mini a-color-secondary inlineBlock unitLineHeight">
                Ends in
            </span>

            <span id="100_dealView_0_dealClock" class="a-size-mini a-color-secondary inlineBlock unitLineHeight">12:13:59</span>
        </div>

</div>
                <div class="a-row a-spacing-mini">

    <a class="a-link-normal" href="https://www.amazon.in/s/ref=gbps_tit_s-4_0227_af8a024a?fst=as%3Aoff&amp;rh=n%3A1571283031%2Cn%3A1983396031%2Ck%3A23rdApril_runningshoes_dotdlist%2Cp_76%3A1318482031%2Cp_6%3AA14FG3FHN6HO9H&amp;keywords=23rdApril_runningshoes_dotdlist&amp;ie=UTF8&amp;qid=1460093112&amp;rnid=1318474031&amp;smid=A14FG3FHN6HO9H&amp;pf_rd_p=900470227&amp;pf_rd_s=slot-4&amp;pf_rd_t=701&amp;pf_rd_i=gb_main&amp;pf_rd_m=A1VBAL9TL5WCBF&amp;pf_rd_r=13ED3AZVD21FX9VX9SS1">
        <span class="a-declarative" data-action="gbdeal-actionrecord" data-gbdeal-actionrecord="{&quot;actionType&quot;:&quot;TITLE&quot;,&quot;position&quot;:&quot;0&quot;,&quot;widgetID&quot;:&quot;100&quot;,&quot;dealID&quot;:&quot;af8a024a&quot;}">

            <span id="dealTitle" class="a-size-base a-color-base dealTitleTwoLine hoverVisible visibleCss singleCellTitle autoHeight" style="width: 210px;">
                Men's Shoes: Minimum 40% Off for Sports Shoes
            </span>
            <span id="dealTitle" class="a-size-base a-color-link dealTitleTwoLine restVisible singleCellTitle autoHeight">
                Men's Shoes: Minimum 40% Off for Sports Shoes
            </span>

        </span>
    </a>

</div>

                    <div class="a-row a-spacing-mini">

        <div class="a-row reviewStars">
            <a class="a-link-normal touchAnchor" href="/gp/product-reviews/B00593XQS6/ref=gbps_rvw_s-4_0227_af8a024a?pf_rd_p=900470227&amp;pf_rd_s=slot-4&amp;pf_rd_t=701&amp;pf_rd_i=gb_main&amp;pf_rd_m=A1VBAL9TL5WCBF&amp;pf_rd_r=13ED3AZVD21FX9VX9SS1">
                <span class="a-declarative" data-action="gbdeal-actionrecord" data-gbdeal-actionrecord="{&quot;actionType&quot;:&quot;REVIEWS&quot;,&quot;position&quot;:&quot;0&quot;,&quot;widgetID&quot;:&quot;100&quot;,&quot;dealID&quot;:&quot;af8a024a&quot;}">

                            <i class="a-icon a-icon-star a-star-5"><span class="a-icon-alt">Avg. Customer Review</span></i>

                    1
            </span>
        </a>

</div>

                            <div class="a-row buttonOuterContainer ">


    <div class="a-row a-spacing-medium">

                        <span class="a-declarative" data-action="gbdeal-actionrecord" data-gbdeal-actionrecord="{&quot;actionType&quot;:&quot;SEE_MORE&quot;,&quot;position&quot;:&quot;0&quot;,&quot;widgetID&quot;:&quot;100&quot;,&quot;dealID&quot;:&quot;af8a024a&quot;}">
                            <span class="a-button a-button-span12 a-button-primary fixedWidth210"><span class="a-button-inner"><a href="https://www.amazon.in/s/ref=gbps_ulm_s-4_0227_af8a024a?fst=as%3Aoff&amp;rh=n%3A1571283031%2Cn%3A1983396031%2Ck%3A23rdApril_runningshoes_dotdlist%2Cp_76%3A1318482031%2Cp_6%3AA14FG3FHN6HO9H&amp;keywords=23rdApril_runningshoes_dotdlist&amp;ie=UTF8&amp;qid=1460093112&amp;rnid=1318474031&amp;smid=A14FG3FHN6HO9H&amp;pf_rd_p=900470227&amp;pf_rd_s=slot-4&amp;pf_rd_t=701&amp;pf_rd_i=gb_main&amp;pf_rd_m=A1VBAL9TL5WCBF&amp;pf_rd_r=13ED3AZVD21FX9VX9SS1" class="a-button-text a-text-center" role="button">
                                View Deal
                            </a></span></span>
                        </span>

    </div>


                            </div>
            </div>

    </div>
</div>

</div></div> 

他们是60多个div,但我希望通过将内容抓到相应的字段来获得前8个div。

1 个答案:

答案 0 :(得分:1)

您可以使用XPath。看看this tutorial on scraping the web in PHP。在你的情况下,你没有在这里包含整个HTML,但我猜你想要捕获第一个div。

$document = new DOMDocument;

libxml_use_internal_errors(true);

$document->loadHTML($output);

$xpath = new DOMXPath($document);

$data = $xpath->query("//div[@id='100_dealView_0']");

foreach ($data as $d) { // in case there are multiple (there shouldn't be)
    echo $d->nodeValue;
}