如果我使用此id('div_a1')/x:div[3]
尝试从this website的匞
小节中提取单个字符◎ 基本解释
,则会收到错误消息:
ValueError:("Invalid XPath: %s" % query)
虽然如果我把它简化为id('div_a1')
,虽然我提取得太多,但我没有得到任何错误。
XPath值id('div_a1')/x:div[3]
是使用我之前使用的Firefox附加组件XPath Checker生成的,并取得了巨大成功。
该命令有什么问题?
如果有任何后果我正在使用Scrapy尝试使用网络抓取工具来提取该组件。这是它的外观:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
#from hz_sample.items import HzSampleItem
class DmozSpider(BaseSpider):
name = "hzIII"
allowed_domains = ["tool.httpcn.com"]
start_urls = ["http://tool.httpcn.com/Html/Zi/28/PWMETBAZTBTBBDTB.shtml"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//p")
for titles in titles:
tester = titles.xpath('id('div_a1')/x:div[3]').extract()
print tester
Chromium告诉我这是//*[@id="div_a1"]/div[3]
< - 似乎也无效。
感谢您的考虑。
答案 0 :(得分:1)
'id("div_a1")/div[3]'
适合我。
请参阅此示例scrapy shell会话:
$ scrapy shell http://tool.httpcn.com/Html/Zi/28/PWMETBAZTBTBBDTB.shtml
...
2015-02-10 12:56:13+0100 [default] DEBUG: Crawled (200) <GET http://tool.httpcn.com/Html/Zi/28/PWMETBAZTBTBBDTB.shtml> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x7fbd4f5a7a50>
[s] item {}
[s] request <GET http://tool.httpcn.com/Html/Zi/28/PWMETBAZTBTBBDTB.shtml>
[s] response <200 http://tool.httpcn.com/Html/Zi/28/PWMETBAZTBTBBDTB.shtml>
[s] settings <scrapy.settings.Settings object at 0x7fbd4f5a1fd0>
[s] spider <DefaultSpider 'default' at 0x7fbd4e8cd390>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
In [1]: response.xpath('id("div_a1")')
Out[1]: [<Selector xpath='id("div_a1")' data=u'<div id="div_a1" style="display:block ">'>]
In [2]: response.xpath('id("div_a1")/div')
Out[2]:
[<Selector xpath='id("div_a1")/div' data=u'<div class="text16"><span class="zi18b">'>,
<Selector xpath='id("div_a1")/div' data=u'<div class="text16"><span class="zi18b">'>,
<Selector xpath='id("div_a1")/div' data=u'<div class="content16">\r\n<span class="zi'>,
<Selector xpath='id("div_a1")/div' data=u'<div class="text16"><hr class="hr"><span'>,
<Selector xpath='id("div_a1")/div' data=u'<div class="text16"><hr class="hr"><span'>,
<Selector xpath='id("div_a1")/div' data=u'<div class="text16"><hr class="hr"><span'>]
In [3]: response.xpath('string(id("div_a1")/div[3])')
Out[3]: [<Selector xpath='string(id("div_a1")/div[3])' data=u'\r\n\u25ce \u57fa\u672c\u89e3\u91ca\r\n\u6bd6 b\xec \u8c28\u614e\uff1a\u60e9\u524d\u6bd6\u540e\uff08\u63a5\u53d7\u8fc7\u53bb\u5931\u8d25\u7684\u6559\u8bad\uff0c\u4ee5\u540e\u5c0f\u5fc3\u4e0d\u91cd\u72af'>]
In [4]: response.xpath('normalize-space(id("div_a1")/div[3])').extract()
Out[4]: [u'\u25ce \u57fa\u672c\u89e3\u91ca \u6bd6 b\xec \u8c28\u614e\uff1a\u60e9\u524d\u6bd6\u540e\uff08\u63a5\u53d7\u8fc7\u53bb\u5931\u8d25\u7684\u6559\u8bad\uff0c\u4ee5\u540e\u5c0f\u5fc3\u4e0d\u91cd\u72af\uff09\u3002 \u64cd\u52b3\uff1a\u201c\u65e0\u6bd6\u4e8e\u6064\u201d\u3002 \u53e4\u540c\u201c\u6ccc\u201d\uff0c\u6cc9\u6c34\u5192\u51fa\u6d41\u6dcc\u7684\u6837\u5b50\u3002 \u7b14\u753b\u6570\uff1a9\uff1b \u90e8\u9996\uff1a\u6bd4\uff1b \u7b14\u987a\u7f16\u53f7\uff1a153545434']
In [5]: print response.xpath('normalize-space(id("div_a1")/div[3])').extract()[0]
◎ 基本解释 毖 bì 谨慎:惩前毖后(接受过去失败的教训,以后小心不重犯)。 操劳:“无毖于恤”。 古同“泌”,泉水冒出流淌的样子。 笔画数:9; 部首:比; 笔顺编号:153545434
In [6]: