XPath切断href属性

时间:2013-09-08 14:22:46

标签: python-2.7 xpath scrapy

我在使用Scath和Scrapy时遇到了一些问题。

我正在查看表格中的链接 - 在浏览器中,它会在查看元素时列出完整链接。然而,scrapy shell正在切断链接的末端。

表中的示例链接:

    http://www.ashp.org/DrugShortages/Current/Bulletin.aspx?id=463

检查元素时:

    <a href="/DrugShortages/Current/Bulletin.aspx?id=463">

在scrapy shell中提取会删除463。

有什么想法吗?

这是蜘蛛的代码。实际上还没有设置它来浏览链接,我想我会先用正确的XPath语法设置所有内容。

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from ashp.items import AshpItem

    class MySpider(BaseSpider):
    name = "ashp"
    allowed_domains = ["ashp.org"]
    start_urls = ["http://ashp.org/menu/DrugShortages/CurrentShortages"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select("//span[@class='pl']")
        for titles in titles:
            title = titles.select("a/text()").extract()
            link = titles.select("a/@href").extract()
            print title, link

1 个答案:

答案 0 :(得分:2)

我认为你的xpath不正确。这是一张打印页面上所有Bulletin链接的蜘蛛:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector


class MySpider(BaseSpider):
    name = "ashp"
    allowed_domains = ["ashp.org"]
    start_urls = ["http://ashp.org/menu/DrugShortages/CurrentShortages"]    

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        links = hxs.select("//div[@id='Mid_3Col']/div/table/tr/td/a")
        for link in links:
            title = link.select("text()").extract()[0]
            link = link.select("@href").extract()[0]
            print title, link

输出:

Acetazolamide Injection /DrugShortages/Current/Bulletin.aspx?id=463 
Acetylcysteine Inhalation Solution /DrugShortages/Current/Bulletin.aspx?id=932 
Acyclovir Injection /DrugShortages/Current/Bulletin.aspx?id=467 
Adenosine Injection /DrugShortages/Current/Bulletin.aspx?id=976 
Alcohol Dehydrated Injection (Ethanol) /DrugShortages/Current/Bulletin.aspx?id=778 
Allopurinol Injection /DrugShortages/Current/Bulletin.aspx?id=998
...