Python:如何将字符串附加到scrapy列表项?

时间:2015-05-02 22:26:53

标签: python list class parsing scrapy

我正在抓取一些网址,但它们都缺少网址的基础,所以我想追加" start_url"作为每个抓取网址的基础。

蜘蛛类:

class MySpider(BaseSpider):
    name = "teslanews"
    allowed_domains = ["teslamotors.com"]
    start_urls = ["http://www.teslamotors.com/blog"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        updates = hxs.xpath('//div[@class="blog-wrapper no-image"]')

        items = []
        for article in updates:
            item = TeslanewsItem()
            item["date"] =  article.xpath('./div/span/span/text()').extract()
            item["title"] = article.xpath('./h2/a/text()').extract()
            item["url"] = article.xpath('./h2/a/@href').extract()
            items.append(item)
        return items

我无法使用item["url"] = article.xpath('./h2/a/@href').extract() + base

进行简单base = "http://www.teslamotors.com"

因为这会将基数添加到结尾,并且由于处于for循环中而逐字逐句地执行,并且每个字母都以逗号分隔。

我对Scrapy比较陌生,所以我不知道该怎么做。

1 个答案:

答案 0 :(得分:2)

from scrapy.spider import BaseSpider
from urlparse import urljoin


class MySpider(BaseSpider):
    name = "teslanews"
    allowed_domains = ["teslamotors.com"]

    base = "http://www.teslamotors.com/blog"

    start_urls = ["http://www.teslamotors.com/blog"]

    def parse(self, response):

        updates = response.xpath('//div[@class="blog-wrapper no-image"]')

        items = []
        for article in updates:
            item = TeslanewsItem()
            item["date"] = article.xpath('./div/span/span/text()').extract()
            item["title"] = article.xpath('./h2/a/text()').extract()
            item['url'] = urljoin(self.base, ''.join(article.xpath('./h2/a/@href').extract()))

        return items