希望你不需要这里的整套代码,但是我有一个问题,即我使用XPath解析HTML,而我没有得到我期望的内容:
# here is the current set of tags I'm interested in
html = '''<div style="padding-top: 10px; clear: both; width: 100%;">
<a href="http://www.amazon.com/review/R41M1I2K413NG/ref=cm_aya_cmt?ie=UTF8&ASIN=B013IZY7RU#wasThisHelpful" ><img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/communities/discussion_boards/comment-sm._CB192250344_.gif" width="16" alt="Comment" hspace="3" align="absmiddle" height="16" border="0" /></a> <a href="http://www.amazon.com/review/R41M1I2K413NG/ref=cm_aya_cmt?ie=UTF8&ASIN=B013IZY7RU#wasThisHelpful" >Comment</a> | <a href="http://www.amazon.com/review/R41M1I2K413NG/ref=cm_cr_rdp_perm" >Permalink</a>'''
我试图获取第一个href
标记的a
值,这是一个很长的网址。为此,我使用以下代码
from lxml import etree
import StringIO
parser = etree.HTMLParser(encoding="utf-8")
tree = etree.parse(StringIO.StringIO(html), parser)
style = 'padding-top: 10px; clear: both; width: 100%;'
xpath = "//div[@style='%s']" % style
xpath += "/a[1]/@href"
# use the XPath expression above to pull out the href value
tree.xpath(xpath)
['http://www.amazon.com/review/R41M1I2K413NG/ref=cm_aya_cmt?ie=UTF8&ASIN=B013IZY7RU#wasThisHelpful']
当我拿出我正在使用的部分并将其粘贴为字符串时,这是有效的。对于使用tree
调用构建的request.get()
,这与['http://www.amazon.com/review/R41M1I2K413NG]
完全相同,我无法弄清楚为什么?它返回的是:
from lxml import etree
import requests
import StringIO
from requests.packages.urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
session = requests.Session()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])
session.mount('http://www.amazon.com', HTTPAdapter(max_retries=retries))
parser = etree.HTMLParser(encoding=encoding)
url = "http://www.amazon.com/gp/cdp/member-reviews/ARPJ98Y7U8K5H?ie=UTF8&display=public&page=3&sort_by=MostRecentReview"
page = session.get(url, timeout=5)
tree = etree.parse(StringIO.StringIO(page.text), parser)
style = 'padding-top: 10px; clear: both; width: 100%;'
xpath = "//div[@style='%s']" % style
xpath += "/a[1]/@href"
# use the XPath expression above to pull out the href value
tree.xpath(xpath)
我无法弄清楚原因。我知道我在黑暗中拍摄,但我只是希望有人遇到属性截断的&#34; XPath返回值&#34;问题。
修改
这是我目前正在使用的完整代码,但它不起作用。它返回上面的截断值。
session
编辑2:
这确实有用。而不是创建get
对象,并使用它来提交parser
请求,然后将其传递给url
,只需将parser
字符串传递给url = "http://www.amazon.com/gp/cdp/member-reviews/ARPJ98Y7U8K5H?ie=UTF8&display=public&page=3&sort_by=MostRecentReview"
tree = etree.parse(url, parser)
for e in tree.xpath("//div[@style='padding-top: 10px; clear: both; width: 100%;']/a[1]/@href"):
print e
工作:
etree.parse(url, parser)
据我所知,当循环遍历多个url时,会话对象将持久保存连接属性,从而加快进程。如果我使用class ProductTemplate(models.Model):
_inherit = "product.template"
def search_read(self, model, fields=False, offset=0, limit=False, domain=None, sort=None):
res = super(ProductTemplate, self).search_read(model, fields=fields, offset=offset, limit=limit, domain=domain, sort=sort)
return res
方法,我担心我会失去效率。
答案 0 :(得分:0)
使用您提供的URL,以下Python代码:
url = "http://www.amazon.com/gp/cdp/member-reviews/ARPJ98Y7U8K5H?ie=UTF8&display=public&page=3&sort_by=MostRecentReview"
from lxml import etree
parser = etree.HTMLParser(encoding="utf-8")
tree = etree.parse(url, parser)
for e in tree.xpath("//div[@style='padding-top: 10px; clear: both; width: 100%;']/a[1]/@href"):
print e
结果如下:
> python ~/test.py
http://www.amazon.com/review/RM8YYCQ57K2CL/ref=cm_aya_cmt/159-5911033-5890330?ie=UTF8&ASIN=B00J9PAZIO#wasThisHelpful
http://www.amazon.com/review/R41M1I2K413NG/ref=cm_aya_cmt/159-5911033-5890330?ie=UTF8&ASIN=B013IZY7RU#wasThisHelpful
http://www.amazon.com/review/R3DT6VUDGIT9SK/ref=cm_aya_cmt/159-5911033-5890330?ie=UTF8&ASIN=B000VYD0MA#wasThisHelpful
http://www.amazon.com/review/RGFW1JM4151MW/ref=cm_aya_cmt/159-5911033-5890330?ie=UTF8&ASIN=B00TQQN5G0#wasThisHelpful
http://www.amazon.com/review/R3I9FFX0MVF1BW/ref=cm_aya_cmt/159-5911033-5890330?ie=UTF8&ASIN=B0048A7NF8#wasThisHelpful
http://www.amazon.com/review/R24TTSQY34VME8/ref=cm_aya_cmt/159-5911033-5890330?ie=UTF8&ASIN=B0115ZHH68#wasThisHelpful
http://www.amazon.com/review/R3C49WWMNQZ007/ref=cm_aya_cmt/159-5911033-5890330?ie=UTF8&ASIN=B00ABAWHJ6#wasThisHelpful
http://www.amazon.com/review/R37724EHW829NB/ref=cm_aya_cmt/159-5911033-5890330?ie=UTF8&ASIN=B00TO5Y3FK#wasThisHelpful
http://www.amazon.com/review/RQKGM5FRXVYSX/ref=cm_aya_cmt/159-5911033-5890330?ie=UTF8&ASIN=B0051QUWKG#wasThisHelpful
http://www.amazon.com/review/R1DW61PMGUDMDJ/ref=cm_aya_cmt/159-5911033-5890330?ie=UTF8&ASIN=B000N8Q2P6#wasThisHelpful
使用您提供的示例代码会产生:
http://www.amazon.com/review/RM8YYCQ57K2CL
http://www.amazon.com/review/R41M1I2K413NG
http://www.amazon.com/review/R3DT6VUDGIT9SK
http://www.amazon.com/review/RGFW1JM4151MW
http://www.amazon.com/review/R3I9FFX0MVF1BW
http://www.amazon.com/review/R24TTSQY34VME8
http://www.amazon.com/review/R3C49WWMNQZ007
http://www.amazon.com/review/R37724EHW829NB
http://www.amazon.com/review/RQKGM5FRXVYSX
http://www.amazon.com/review/R1DW61PMGUDMDJ
这是因为session.get()
返回的HTML页面中没有任何URL具有任何GET参数;或者是因为在这种情况下服务器没有返回带有GET参数的URL,或者因为requests
剥离了GET参数。