我使用Scrapy shell加载此网页:
scrapy shell "http://goo.gl/VMNMuK"
想要找到:
response.xpath("//div[@class='inline']")
然而,它返回[]。如果我在此网页的chrome检查中使用find,我可以找到"//div[@class='inline']"
中的3个。这是一个错误吗?
答案 0 :(得分:2)
此网页的内嵌内容位于</body></html>
...
</body></html>
<script type="text/javascript">
var cpro_id="u2312677";
...
以下是一些尝试:
rest = response.body[response.body.find('</html>')+8:]
from scrapy.selector import Selector
Selector(text=rest).xpath("//div[@class='inline']")
答案 1 :(得分:1)
您还可以使用html5lib
来解析响应正文,例如work on an lxml
document using lxml.html.html5parser
。在下面的示例scrapy shell会话中,我必须使用namespaces
来使用XPath:
$ scrapy shell http://chuansong.me/n/2584954
2016-03-07 12:06:42 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
2016-03-07 12:06:44 [scrapy] DEBUG: Crawled (200) <GET http://chuansong.me/n/2584954> (referer: None)
In [1]: response.xpath('//div[@class="inline"]')
Out[1]: []
In [2]: response.xpath('//*[@class="inline"]')
Out[2]: []
In [3]: response.xpath('//html')
Out[3]: [<Selector xpath='//html' data=u'<html lang="zh-CN">\n<head>\n<meta http-eq'>]
In [4]: from lxml.html import tostring, html5parser
In [5]: dochtml5 = html5parser.document_fromstring(response.body_as_unicode())
In [6]: type(dochtml5)
Out[6]: lxml.etree._Element
In [7]: dochtml5.xpath('//div[@class="inline"]')
Out[7]: []
In [8]: dochtml5.xpath('//html:div[@class="inline"]', namespaces={"html": "http://www.w3.org/1999/xhtml"})
Out[8]:
[<Element {http://www.w3.org/1999/xhtml}div at 0x7f858cfe3998>,
<Element {http://www.w3.org/1999/xhtml}div at 0x7f858cf691b8>,
<Element {http://www.w3.org/1999/xhtml}div at 0x7f858cf73680>]
In [9]: for div in dochtml5.xpath('//html:div[@class="inline"]', namespaces={"html": "http://www.w3.org/1999/xhtml"}):
print tostring(div)
....:
<html:div xmlns:html="http://www.w3.org/1999/xhtml" class="inline">
<html:span>新浪名博、畅销书作家王珣的原创自媒体,“芙蓉树下”的又一片新天地,愿你美丽优雅地走过全世界。</html:span>
</html:div>
<html:div xmlns:html="http://www.w3.org/1999/xhtml" class="inline">
<html:img src="http://q.chuansong.me/beauties-4.jpg" alt="美人的底气 微信二维码" height="210px" width="210px"></html:img>
</html:div>
<html:div xmlns:html="http://www.w3.org/1999/xhtml" class="inline">
<html:script src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js" async=""></html:script>
<html:ins style="display:inline-block;width:210px;height:210px" data-ad-client="ca-pub-0996811467255783" class="adsbygoogle" data-ad-slot="2990020277"></html:ins>
<html:script>(adsbygoogle = window.adsbygoogle || []).push({});</html:script>
</html:div>