我正在抓取一些网址,但它们都缺少网址的基础,所以我想追加" start_url"作为每个抓取网址的基础。
蜘蛛类:
class MySpider(BaseSpider):
name = "teslanews"
allowed_domains = ["teslamotors.com"]
start_urls = ["http://www.teslamotors.com/blog"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
updates = hxs.xpath('//div[@class="blog-wrapper no-image"]')
items = []
for article in updates:
item = TeslanewsItem()
item["date"] = article.xpath('./div/span/span/text()').extract()
item["title"] = article.xpath('./h2/a/text()').extract()
item["url"] = article.xpath('./h2/a/@href').extract()
items.append(item)
return items
我无法使用item["url"] = article.xpath('./h2/a/@href').extract() + base
base = "http://www.teslamotors.com"
因为这会将基数添加到结尾,并且由于处于for循环中而逐字逐句地执行,并且每个字母都以逗号分隔。
我对Scrapy比较陌生,所以我不知道该怎么做。
答案 0 :(得分:2)
from scrapy.spider import BaseSpider
from urlparse import urljoin
class MySpider(BaseSpider):
name = "teslanews"
allowed_domains = ["teslamotors.com"]
base = "http://www.teslamotors.com/blog"
start_urls = ["http://www.teslamotors.com/blog"]
def parse(self, response):
updates = response.xpath('//div[@class="blog-wrapper no-image"]')
items = []
for article in updates:
item = TeslanewsItem()
item["date"] = article.xpath('./div/span/span/text()').extract()
item["title"] = article.xpath('./h2/a/text()').extract()
item['url'] = urljoin(self.base, ''.join(article.xpath('./h2/a/@href').extract()))
return items