从lxml Comment获取文本

时间:2014-10-20 19:11:06

标签: python xpath web-scraping lxml lxml.html

我正在尝试获取_Comment的内容。我已经研究了很多关于它是如何做到的,但我不知道如何从td元素访问该函数以获取文本。如果有帮助的话,我在python Scrapy模块中使用xpaths。

td = None [_Element]
    <built-in function Comment> = None [_Comment]
    a = None [_Element]

td元素的HTML是:

<table class="crIFrameReviewList">

    <tr>
      <td>

<!-- BOUNDARY -->
<a name="R2L4AFEICL8GG6"></a><br />


<div style="margin-left:0.5em;">

      <div style="margin-bottom:0.5em;">
        304 of 309 people found the following review helpful
      </div>
      <div style="margin-bottom:0.5em;">
        <span style='margin-left: -5px;'><img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-5-0._V192240867_.gif" width="64" alt="5.0 out of 5 stars" title="5.0 out of 5 stars" height="12" border="0" /> </span>
        <b>Great Travel Zoom</b>, <nobr>April 9, 2014</nobr>
      </div>
      <div style="margin-bottom:0.5em;">

      <div class="tiny" style="margin-bottom:0.5em;">
        <span class="crVerifiedStripe"><b class="h3color tiny" style="margin-right: 0.5em;">Verified Purchase</b><span class="tiny verifyWhatsThis">(<a href="http://www.amazon.com/gp/community-help/amazon-verified-purchase" target="AmazonHelp" onclick="amz_js_PopWin('http://www.amazon.com/gp/community-help/amazon-verified-purchase', 'AmazonHelp', 'width=400,height=500,resizable=1,scrollbars=1,toolbar=0,status=1');return false; ">What's this?</a>)</span></span>
      </div>
      <div class="tiny" style="margin-bottom:0.5em;">
        <b><span class="h3color tiny">This review is from: </span>Canon PowerShot SX700 HS Digital Camera (Black) (Electronics)</b>
      </div>

For the recent few years Canon has made great efforts to improve their travel-zoom compact cameras, and the new SX700 is their next remarkable achievement on that way. It's a little bit bigger than its predecessor (SX280) but it is very well built and has an attractive look and feel (I like the black one). It also got a new front grip which makes one-hand shooting more convenient, even when shooting video, since the Video button was moved from the back to the top and you can now use your thumb solely for holding the camera.<br /><br />Here is a brief list of the new camera pros & cons:<br /><br />PROS:<br />* A very good design and build quality with the attractive finish.<br />* A new powerful 30x optical zoom lens in just a pocket-size body.<br />* Incredible range from 25mm wide to 750mm telephoto for stills and video.<br />* Zoom Framing Assist - very useful new feature to compose your pictures at long telephoto.<br />* Very effective optical Intelligent Image Stabilization for...


<a href="http://www.amazon.com/Canon-PowerShot-SX700-Digital-Camera/product-reviews/B00I58M26Y" target="_top">Read more</a>
      <div style="padding-top: 10px; clear: both; width: 100%;">

1 个答案:

答案 0 :(得分:1)

使用div xpath表达式找class="reviewText" .//div[@class="reviewText"]并使用tostring()方法将元素转储到字符串:

text

打印:

import lxml.html

data = """
your html here
"""

td = lxml.html.fromstring(data)
review = td.find('.//div[@class="reviewText"]')
print lxml.html.tostring(review, method="text")