Question

我目前正在尝试使用Scrapey在python中创建一个简单的爬虫。我想要它做的是阅读链接列表并保存他们链接到的网站的HTML。现在，我能够获得所有的URL，但我无法弄清楚如何下载页面。到目前为止，这是我的蜘蛛的代码：

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from tutorial.items import BookItem

# Book scrappy spider

class DmozSpider(BaseSpider):
    name = "book"
    allowed_domains = ["learnpythonthehardway.org"]
    start_urls = [
        "http://www.learnpythonthehardway.org/book/",
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        file = open(filename,'wb')
        file.write(response.body)
        file.close()

        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//ul/li')
        items = []
        for site in sites:
            item = BookItem()
            item['title'] = site.select('a/text()').extract()
            item['link'] = site.select('a/@href').extract()
            items.append(item)
        return items

Answer 1

在parse方法中，返回返回项目列表中的Request个对象以触发下载：

for site in sites:
    ...
    items.append(item)
    items.append(Request(item['link']), callback=self.parse)

这将导致抓取工具为每个链接生成BookItem，但也会递归并下载每本书的页面。当然，如果要以不同方式解析子页面，可以指定不同的回调（例如self.parsebook）。

使用scrapy创建一个简单的python爬虫

1 个答案: