Question

这是我第一次使用网页抓取，我不确定自己是否表现不错。问题是我想同时抓取并抓取数据。

获取我要抓的所有链接
将它们存储到MongoDB中

逐一访问他们以获取他们的内容

# Crawling: get all links to be scrapped later on 
class LinkCrawler(Spider):
    name="link"
    allowed_domains = ["website.com"]
    start_urls = ["https://www.website.com/offres?start=%s" % start for start in xrange(0,10000,20)]
    def parse(self,response):
        # loop for all pages
        next_page = Selector(response).xpath('//li[@class="active"]/following-sibling::li[1]/a/@href').extract()

        if not not next_page:
            yield Request("https://"+next_page[0], callback = self.parse)

        # loop for all links in a single page
        links = Selector(response).xpath('//div[@class="row-fluid job-details pointer"]/div[@class="bloc-right"]/div[@class="row-fluid"]')

        for link in links:
            item = Link()
            url = response.urljoin(link.xpath('a/@href')[0].extract())
            item['url'] = url
            items.append(item)

        for item in items:
            yield item

# Scraping: get all the stored links on MongoDB and scrape them????

Answer 1

您的用例究竟是什么？您是否主要对他们所导致的页面的链接或内容感兴趣？即是否有任何理由首先将链接存储在MongoDB中并稍后刮取页面？如果您确实需要在MongoDB中存储链接，最好使用item pipeline来存储项目。在链接中，甚至还有在MongoDB中存储项目的示例。如果您需要更复杂的内容，请查看scrapy-mongodb包。

除此之外，对您发布的实际代码有一些评论：

而不是Selector(response).xpath(...)只使用response.xpath(...)。
如果您只需要选择器中的第一个提取元素，请使用extract_first()而不是使用extract()并编制索引。
请勿使用if not not next_page:，请使用if next_page:。
不需要items上的第二个循环，yield循环中的links项。

如何同时抓取和抓取数据？

1 个答案: