ValueError:("无效的XPath:%s"%查询)XPath Checker生成错误的代码

时间:2015-02-10 11:37:09

标签: python html xpath web-crawler scrapy

如果我使用此id('div_a1')/x:div[3]尝试从this website小节中提取单个字符◎ 基本解释,则会收到错误消息:

ValueError:("Invalid XPath: %s" % query)

虽然如果我把它简化为id('div_a1'),虽然我提取得太多,但我没有得到任何错误。

XPath值id('div_a1')/x:div[3]是使用我之前使用的Firefox附加组件XPath Checker生成的,并取得了巨大成功。

该命令有什么问题?

如果有任何后果我正在使用Scrapy尝试使用网络抓取工具来提取该组件。这是它的外观:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
#from hz_sample.items import HzSampleItem

class DmozSpider(BaseSpider):
name = "hzIII"
allowed_domains = ["tool.httpcn.com"]
start_urls = ["http://tool.httpcn.com/Html/Zi/28/PWMETBAZTBTBBDTB.shtml"]

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    titles = hxs.select("//p")

    for titles in titles:
    tester = titles.xpath('id('div_a1')/x:div[3]').extract() 
        print tester

Chromium告诉我这是//*[@id="div_a1"]/div[3]< - 似乎也无效。

感谢您的考虑。

1 个答案:

答案 0 :(得分:1)

'id("div_a1")/div[3]'适合我。 请参阅此示例scrapy shell会话:

$ scrapy shell http://tool.httpcn.com/Html/Zi/28/PWMETBAZTBTBBDTB.shtml
...
2015-02-10 12:56:13+0100 [default] DEBUG: Crawled (200) <GET http://tool.httpcn.com/Html/Zi/28/PWMETBAZTBTBBDTB.shtml> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fbd4f5a7a50>
[s]   item       {}
[s]   request    <GET http://tool.httpcn.com/Html/Zi/28/PWMETBAZTBTBBDTB.shtml>
[s]   response   <200 http://tool.httpcn.com/Html/Zi/28/PWMETBAZTBTBBDTB.shtml>
[s]   settings   <scrapy.settings.Settings object at 0x7fbd4f5a1fd0>
[s]   spider     <DefaultSpider 'default' at 0x7fbd4e8cd390>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

In [1]: response.xpath('id("div_a1")')
Out[1]: [<Selector xpath='id("div_a1")' data=u'<div id="div_a1" style="display:block ">'>]

In [2]: response.xpath('id("div_a1")/div')
Out[2]: 
[<Selector xpath='id("div_a1")/div' data=u'<div class="text16"><span class="zi18b">'>,
 <Selector xpath='id("div_a1")/div' data=u'<div class="text16"><span class="zi18b">'>,
 <Selector xpath='id("div_a1")/div' data=u'<div class="content16">\r\n<span class="zi'>,
 <Selector xpath='id("div_a1")/div' data=u'<div class="text16"><hr class="hr"><span'>,
 <Selector xpath='id("div_a1")/div' data=u'<div class="text16"><hr class="hr"><span'>,
 <Selector xpath='id("div_a1")/div' data=u'<div class="text16"><hr class="hr"><span'>]

In [3]: response.xpath('string(id("div_a1")/div[3])')
Out[3]: [<Selector xpath='string(id("div_a1")/div[3])' data=u'\r\n\u25ce \u57fa\u672c\u89e3\u91ca\r\n\u6bd6 b\xec \u8c28\u614e\uff1a\u60e9\u524d\u6bd6\u540e\uff08\u63a5\u53d7\u8fc7\u53bb\u5931\u8d25\u7684\u6559\u8bad\uff0c\u4ee5\u540e\u5c0f\u5fc3\u4e0d\u91cd\u72af'>]

In [4]: response.xpath('normalize-space(id("div_a1")/div[3])').extract()
Out[4]: [u'\u25ce \u57fa\u672c\u89e3\u91ca \u6bd6 b\xec \u8c28\u614e\uff1a\u60e9\u524d\u6bd6\u540e\uff08\u63a5\u53d7\u8fc7\u53bb\u5931\u8d25\u7684\u6559\u8bad\uff0c\u4ee5\u540e\u5c0f\u5fc3\u4e0d\u91cd\u72af\uff09\u3002 \u64cd\u52b3\uff1a\u201c\u65e0\u6bd6\u4e8e\u6064\u201d\u3002 \u53e4\u540c\u201c\u6ccc\u201d\uff0c\u6cc9\u6c34\u5192\u51fa\u6d41\u6dcc\u7684\u6837\u5b50\u3002 \u7b14\u753b\u6570\uff1a9\uff1b \u90e8\u9996\uff1a\u6bd4\uff1b \u7b14\u987a\u7f16\u53f7\uff1a153545434']

In [5]: print response.xpath('normalize-space(id("div_a1")/div[3])').extract()[0]
◎ 基本解释 毖 bì 谨慎:惩前毖后(接受过去失败的教训,以后小心不重犯)。 操劳:“无毖于恤”。 古同“泌”,泉水冒出流淌的样子。 笔画数:9; 部首:比; 笔顺编号:153545434

In [6]: