php curl DOM,如何用样式提取内容

时间:2014-04-19 10:35:10

标签: php jquery html css curl

标题中可能不清楚。我想在这里实现的是复制现有网页中特定div中的所有内容(不归我所有)。现在代码可以成功提取内容。

提取器代码:         

    // Get Data
    $curl_handle=curl_init();
    curl_setopt($curl_handle, CURLOPT_URL,'http://au.creative.com/p/speakers/creative-t4-wireless');
    curl_setopt($curl_handle, CURLOPT_IPRESOLVE, CURL_IPRESOLVE_V4 );
    curl_setopt($curl_handle, CURLOPT_POST, false);
    curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 2);
    curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($curl_handle, CURLOPT_HEADER, 0);
    curl_setopt($curl_handle, CURLOPT_USERAGENT, 'Mozilla/5.001 (windows; U; NT4.0; en-US; rv:1.0) Gecko/25250101');
    //$html = curl_exec($curl_handle);
    $html = file_get_html('http://au.creative.com/p/speakers/creative-t4-wireless');
    curl_close($curl_handle);


    //Display required part
    $xml = new DomDocument;
    @$xml->loadHTML($html);
    $xpath = new DomXpath($xml);
    $info = $xpath->query('//div[@class="wrapper features-contents"]')->item(0);
    echo utf8_decode($xml->saveXML($info));
    echo '<textarea rows="500" cols="100">' . $xml->saveXML($info) .'</textarea>';

提取的代码:

<h3 class="feature-header">Pair and connect in so many ways</h3> 
<div class="row product-info-row"> 
<div class="span12"> 
<div id="slides-modes-21677" style="position:relative;">
<a id="arrow-left-21677" class="slidesjs-previous slidesjs-navigation" href="#">
<img src="//d287ku8w5owj51.cloudfront.net/inline/products/21430/arrow_left.jpg" border="0" alt="<" width="42" height="54"/></a> <div id="slide1">
<img style="margin:0 20px 0 20px;" src="//d287ku8w5owj51.cloudfront.net/inline/products/21677/bluetooth.jpg.ashx?width=520&height=383" alt="Freedom without compromise" width="520" height="383" align="right"/>

很明显只提取了类名。我记得当你从chrome复制网页内容并粘贴到Firefox时。 css是变换信息内联样式。我有可能在php中做到吗?

我在firefox中获得的网页内容的一部分:

    <h3 class="feature-header" style="font-size: 2.2857em; margin: 20px 0px 30px; font-family: proxima-nova, Helvetica, Arial, sans-serif; line-height: 1.4; text-transform: uppercase; color: #666666; font-style: normal; font-variant: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;">PAIR AND CONNECT IN SO MANY WAYS</h3>
    <div class="row product-info-row" style="margin-bottom: 60px; margin-left: -20px; color: #666666; font-family: proxima-nova, Helvetica, Arial, sans-serif; font-size: 14px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 21px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;">
    <div class="span12" style="float: left; min-height: 1px; margin-left: 20px; width: 940px;">
    <div id="slides-modes-21677" style="position: relative; overflow: hidden;">
    <div class="slidesjs-container" style="overflow: hidden; position: relative; width: 940px; height: 383px;">
    <div class="slidesjs-control" style="position: relative; left: 0px; width: 940px; height: 383px;">
    <div id="slide1" class="slidesjs-slide" style="position: absolute; top: 0px; left: 0px; width: 940px; z-index: 10; -webkit-backface-visibility: hidden;"><img style="border: 0px; vertical-align: middle; margin: 0px 20px;" src="http://d287ku8w5owj51.cloudfront.net/inline/products/21677/bluetooth.jpg.ashx?width=520&amp;height=383" alt="Freedom without compromise" width="520" height="383" align="right" />
    <h3 class="feature-subheader" style="font-size: 1.7142em; margin: 30px 0px 0.8em; font-family: proxima-nova, Helvetica, Arial, sans-serif; line-height: 1.25em; color: #252525; font-weight: normal;">Freedom without compromise</h3>
    <p style="margin: 0px 0px 1em;"><em>Bluetooth</em><span class="Apple-converted-space">&nbsp;</span>wireless connectivity gives you the freedom and convenience to move around your room with your smart device as you're not tied down by any wires.<sup style="line-height: 0; position: relative; vertical-align: baseline; top: -0.5em;">1</sup><span class="Apple-converted-space">&nbsp;</span>And with aptX, you're assured of uncompromised audio quality.</p>
    </div>
    <div id="slide2" class="slidesjs-slide" style="position: absolute; top: 0px; left: 940px; width: 940px; z-index: 0; display: block; -webkit-backface-visibility: hidden;">
    <div style="margin: 0px 20px; vertical-align: middle; float: left;"><img id="fea_nfc_2" style="border: 0px; vertical-align: middle;" src="http://img.creative.com/inline/products/21677/fea_nfc_2.jpg" alt="" /></div>
    <h3 class="feature-subheader" style="font-size: 1.7142em; margin: 30px 0px 0.8em; font-family: proxima-nova, Helvetica, Arial, sans-serif; line-height: 1.25em; color: #252525; font-weight: normal;">Just tap and pair</h3>
    <p style="margin: 0px 0px 1em;">With the NFC (Near Field Communication) receptor on the Audio Control Pod, you can simply tap your NFC-enabled device on it to pair and then you're all set to stream and enjoy your music.</p>
    </div>
    <div id="slide3" class="slidesjs-slide" style="position: absolute; top: 0px; left: -940px; width: 940px; z-index: 0; display: block; -webkit-backface-visibility: hidden;"><img style="border: 0px; vertical-align: middle; margin: 0px 20px;" src="http://d287ku8w5owj51.cloudfront.net/inline/products/21677/multipoint.png.ashx?width=520&amp;height=383" alt="Stay connected" width="520" height="383" align="right" />
    <h3 class="feature-subheader" style="font-size: 1.7142em; margin: 30px 0px 0.8em; font-family: proxima-nova, Helvetica, Arial, sans-serif; line-height: 1.25em; color: #252525; font-weight: normal;">Stay connected</h3>
    <p style="margin: 0px 0px 1em;">Connect with multiple<span class="Apple-converted-space">&nbsp;</span><em>Bluetooth</em><span class="Apple-converted-space">&nbsp;</span>devices! With Creative Multipoint, you can have two<span class="Apple-converted-space">&nbsp;</span><em>Bluetooth</em><span class="Apple-converted-space">&nbsp;</span>stereo devices paired to the speakers at any one time and easily toggle between them.<sup style="line-height: 0; position: relative; vertical-align: baseline; top: -0.5em;">2</sup></p>
    </div>
    </div>
    </div>
    <a id="arrow-right-21677" class="slidesjs-next slidesjs-navigation" style="color: #0cbdef; text-decoration: none; cursor: pointer; display: block; overflow: hidden; position: absolute; top: 164.5px; z-index: 30; right: 0px;" href="http://au.creative.com/p/speakers/creative-t4-wireless#"><img style="border: 0px; vertical-align: middle;" src="http://d287ku8w5owj51.cloudfront.net/inline/products/21430/arrow_right.jpg" alt="&lt;" width="42" height="54" border="0" /></a></div>
    </div>
    </div>
    <div class="row" style="margin-left: -20px; color: #666666; font-family: proxima-nova, Helvetica, Arial, sans-serif; font-size: 14px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 21px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;">
    <div class="span6" style="float: left; min-height: 1px; margin-left: 20px; width: 460px;">
    <div class="slides-perfect-audio" style="width: 460px; display: block; overflow: hidden;">
    <div class="slidesjs-container" style="overflow: hidden; position: relative; width: 460px; height: 327.8723404255319px;">
    <div class="slidesjs-control" style="position: relative; left: 0px; width: 460px; height: 327.8723404255319px;">
    <div class="slidesjs-slide" style="position: absolute; top: 0px; left: 0px; width: 460px; z-index: 10; -webkit-backface-visibility: hidden;"><img style="border: 0px; vertical-align: middle;" src="http://d287ku8w5owj51.cloudfront.net/inline/products/21677/optical.png" alt="Optical input" /></div>
    <div class="slidesjs-slide" style="position: absolute; top: 0px; left: 460px; width: 460px; z-index: 0; display: block; -webkit-backface-visibility: hidden;"><img style="border: 0px; vertical-align: middle;" src="http://d287ku8w5owj51.cloudfront.net/inline/products/21677/RCA.png" alt="RCA input" /></div>
    <div class="slidesjs-slide" style="position: absolute; top: 0px; left: -460px; width: 460px; z-index: 0; display: block; -webkit-backface-visibility: hidden;"><img style="border: 0px; vertical-align: middle;" src="http://d287ku8w5owj51.cloudfront.net/inline/products/21677/aux_in.png" alt="Aux in" /></div>
    </div>
    </div>
    <ul class="slidesjs-pagination" style="margin: 10px auto; padding: 0px; display: block; width: 60px; list-style: none;">
    <li class="slidesjs-pagination-item" style="display: inline; list-style: none; margin: 0px; padding: 0px;"><a class="active" style="color: #cccccc !important; text-decoration: none; cursor: pointer; padding: 0px; background-color: #999999; font-size: 1px; width: 8px; height: 8px; border-top-left-radius: 4px; border-top-right-radius: 4px; border-bottom-right-radius: 4px; border-bottom-left-radius: 4px; border: 1px solid #999999; margin-right: 5px; display: inline-block; background-position: 100% 0%;" href="http://au.creative.com/p/speakers/creative-t4-wireless#" data-slidesjs-item="0">1</a></li>
    <li class="slidesjs-pagination-item" style="display: inline; list-style: none; margin: 0px; padding: 0px;"><a style="color: #ffffff; text-decoration: none; cursor: pointer; font-size: 1px; width: 8px; height: 8px; border-top-left-radius: 4px; border-top-right-radius: 4px; border-bottom-right-radius: 4px; border-bottom-left-radius: 4px; border: 1px solid #999999; background-color: #ffffff; margin-right: 5px; display: inline-block;" href="http://au.creative.com/p/speakers/creative-t4-wireless#" data-slidesjs-item="1">2</a></li>
    <li class="slidesjs-pagination-item" style="display: inline; list-style: none; margin: 0px; padding: 0px;"><a style="color: #ffffff; text-decoration: none; cursor: pointer; font-size: 1px; width: 8px; height: 8px; border-top-left-radius: 4px; border-top-right-radius: 4px; border-bottom-right-radius: 4px; border-bottom-left-radius: 4px; border: 1px solid #999999; background-color: #ffffff; margin-right: 5px; display: inline-block;" href="http://au.creative.com/p/speakers/creative-t4-wireless#" data-slidesjs-item="2">3</a></li>
    </ul>
    </div>
    </div>
    <div class="span6" style="float: left; min-height: 1px; margin-left: 20px; width: 460px;"><img style="border: 0px; vertical-align: middle;" src="http://d287ku8w5owj51.cloudfront.net/inline/products/21677/playing_games.jpg" alt="Switch to private listening" /></div>
    </div>
    <div class="row product-info-row" style="margin-bottom: 60px; margin-left: -20px; color: #666666; font-family: proxima-nova, Helvetica, Arial, sans-serif; font-size: 14px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 21px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;">
    <div class="span6" style="float: left; min-height: 1px; margin-left: 20px; width: 460px;">
    <h4 class="feature-subheader" style="font-size: 1.7142em; margin: 30px 0px 0.8em; font-family: proxima-nova, Helvetica, Arial, sans-serif; line-height: 1.25em; color: #252525; font-weight: normal;">Even more connectivity options</h4>
    <p style="margin: 0px 0px 1em;">The Creative T4 Wireless comes with an optical input for digital signals, so you can directly send audio from sources such as your HD TV or sound cards without loss of resolution. It also has RCA analog inputs for connection to your video console or DVD player, as well as a 3.5mm input for connection to smart devices and portable media players.</p>
    </div>
    <div class="span6" style="float: left; min-height: 1px; margin-left: 20px; width: 460px;">
    <h4 class="feature-subheader" style="font-size: 1.7142em; margin: 30px 0px 0.8em; font-family: proxima-nova, Helvetica, Arial, sans-serif; line-height: 1.25em; color: #252525; font-weight: normal;">Switch to private listening</h4>
    <p style="margin: 0px 0px 1em;">For late-night gaming or movie-watching, there's no need to worry about waking up the household. The Creative T4 Wireless' Audio Control Pod is integrated with a dedicated headphone jack so that you can conveniently plug in your headphones when the need arises.</p>
    </div>
    </div>

1 个答案:

答案 0 :(得分:0)

为什么不使用wget呢?

wget \
     --recursive \
     --no-clobber \
     --page-requisites \
     --html-extension \
     --convert-links \
     --restrict-file-names=windows \
     --domains website.org \
     --no-parent \
         www.website.org/tutorials/html/

http://www.linuxjournal.com/content/downloading-entire-web-site-wget