Scrapy递归下载内容

时间:2013-07-07 05:47:58

标签: python python-2.7 web-scraping scrapy

几次敲我的头后,我终于来到了这里。

问题:我正在尝试下载每个craiglist发布的内容。 根据内容,我的意思是“发布机构”,如手机的描述。寻找新的旧手机,因为iPhone完成了所有的兴奋。

代码是 Michael Herman 的精彩作品。

My Spider Class

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import *
from craig.items import CraiglistSampleItem

class MySpider(CrawlSpider):
    name = "craigs"
    allowed_domains = ["craigslist.org"]
    start_urls = ["http://minneapolis.craigslist.org/moa/"]

    rules = (Rule (SgmlLinkExtractor(allow=("index\d00\.html", ),restrict_xpaths=('//p[@class="nextpage"]',))
    , callback="parse_items", follow= True),
    )

    def parse_items(self,response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select("//span[@class='pl']")
        items = []
        for titles in titles:
            item = CraiglistSampleItem()
            item ["title"] = titles.select("a/text()").extract()
            item ["link"] = titles.select("a/@href").extract()
            items.append(item)
        return items

和Item类

from scrapy.item import Item, Field

class CraiglistSampleItem(Item):
    title = Field()
    link = Field()

由于代码将遍历许多链接,因此我想在sepearte csv中保存每个手机的描述,但csv中的另一列也可以。

任何领导!!!

1 个答案:

答案 0 :(得分:5)

您应该返回/生成scrapy Request实例,而不是返回parse_items方法中的项目,以便从项目页面linktitle获取说明传递到Item内部的Itemfrom scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.http import Request from scrapy.selector import * from scrapy.item import Item, Field class CraiglistSampleItem(Item): title = Field() link = Field() description = Field() class MySpider(CrawlSpider): name = "craigs" allowed_domains = ["craigslist.org"] start_urls = ["http://minneapolis.craigslist.org/moa/"] rules = (Rule(SgmlLinkExtractor(allow=("index\d00\.html", ), restrict_xpaths=('//p[@class="nextpage"]',)) , callback="parse_items", follow=True), ) def parse_items(self, response): hxs = HtmlXPathSelector(response) titles = hxs.select("//span[@class='pl']") for title in titles: item = CraiglistSampleItem() item["title"] = title.select("a/text()").extract()[0] item["link"] = title.select("a/@href").extract()[0] url = "http://minneapolis.craigslist.org%s" % item["link"] yield Request(url=url, meta={'item': item}, callback=self.parse_item_page) def parse_item_page(self, response): hxs = HtmlXPathSelector(response) item = response.meta['item'] item['description'] = hxs.select('//section[@id="postingbody"]/text()').extract() return item 内部:

description

运行它并在输出csv文件中查看其他{{1}}列。

希望有所帮助。