我在使用Scath和Scrapy时遇到了一些问题。
我正在查看表格中的链接 - 在浏览器中,它会在查看元素时列出完整链接。然而,scrapy shell正在切断链接的末端。
表中的示例链接:
http://www.ashp.org/DrugShortages/Current/Bulletin.aspx?id=463
检查元素时:
<a href="/DrugShortages/Current/Bulletin.aspx?id=463">
在scrapy shell中提取会删除463。
有什么想法吗?
这是蜘蛛的代码。实际上还没有设置它来浏览链接,我想我会先用正确的XPath语法设置所有内容。
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from ashp.items import AshpItem
class MySpider(BaseSpider):
name = "ashp"
allowed_domains = ["ashp.org"]
start_urls = ["http://ashp.org/menu/DrugShortages/CurrentShortages"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//span[@class='pl']")
for titles in titles:
title = titles.select("a/text()").extract()
link = titles.select("a/@href").extract()
print title, link
答案 0 :(得分:2)
我认为你的xpath不正确。这是一张打印页面上所有Bulletin
链接的蜘蛛:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class MySpider(BaseSpider):
name = "ashp"
allowed_domains = ["ashp.org"]
start_urls = ["http://ashp.org/menu/DrugShortages/CurrentShortages"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
links = hxs.select("//div[@id='Mid_3Col']/div/table/tr/td/a")
for link in links:
title = link.select("text()").extract()[0]
link = link.select("@href").extract()[0]
print title, link
输出:
Acetazolamide Injection /DrugShortages/Current/Bulletin.aspx?id=463
Acetylcysteine Inhalation Solution /DrugShortages/Current/Bulletin.aspx?id=932
Acyclovir Injection /DrugShortages/Current/Bulletin.aspx?id=467
Adenosine Injection /DrugShortages/Current/Bulletin.aspx?id=976
Alcohol Dehydrated Injection (Ethanol) /DrugShortages/Current/Bulletin.aspx?id=778
Allopurinol Injection /DrugShortages/Current/Bulletin.aspx?id=998
...