Scrapy Python循环到下一个未破解的链接

时间:2016-07-18 02:33:36

标签: python scrapy scrapy-spider

我试图让我的蜘蛛超过一个列表,并抓住它可以找到的所有网址,然后他们抓取一些数据并返回继续下一个未经破解的链接,如果我运行蜘蛛我可以看到它返回到起始页面但是试图再次刮掉同一页面,然后退出任何对python来说很新的代码建议。

import scrapy
import re
from production.items import ProductionItem, ListResidentialItem

class productionSpider(scrapy.Spider):
    name = "production"
    allowed_domains = ["domain.com"]
    start_urls = [
        "http://domain.com/list"
    ]

    def parse(self, response):
        for sel in response.xpath('//html/body'):
            item = ProductionItem()
            item['listurl'] = sel.xpath('//a[@id="link101"]/@href').extract()[0]

            request = scrapy.Request(item['listurl'], callback=self.parseBasicListingInfo)
            yield request

    def parseBasicListingInfo(item, response):
        item = ListResidentialItem()
        item['title'] = response.xpath('//span[@class="detail"]/text()').extract()
        return item

澄清: 我传递[0]所以只需要列表的第一个链接 但我希望它继续使用下一个未破解的链接

运行蜘蛛后

输出:

2016-07-18 12:11:20 [scrapy] DEBUG: Crawled (200) <GET http://www.domain.com/robots.txt> (referer: None)
2016-07-18 12:11:20 [scrapy] DEBUG: Crawled (200) <GET http://www.domain.com/list> (referer: None)
2016-07-18 12:11:21 [scrapy] DEBUG: Crawled (200) <GET http://www.domain.com/link1> (referer: http://www.domain.com/list)
2016-07-18 12:11:21 [scrapy] DEBUG: Scraped from <200 http://www.domain.com/link1>
{'title': [u'\rlink1\r']}

2 个答案:

答案 0 :(得分:1)

这应该可以正常工作。更改域和xpath,然后查看

    import scrapy
    import re
    from scrapy.spiders import CrawlSpider, Rule
    from scrapy.linkextractors import LinkExtractor

    class ProdItems(scrapy.Item):
        listurl = scrapy.Field()
        title = scrapy.Field()

    class productionSpider(scrapy.Spider):
        name = "production"
        allowed_domains = ["domain.com"]
        start_urls = [
            "http://domain.com/list"
        ]

        def parse(self, response):
            for sel in response.xpath('//html/body'):
                item = ProductionItem()
                list_urls = sel.xpath('//a[@id="link101"]/@href').extract()
                for url in list_urls:
                  item['listurl'] = url
                  yield scrapy.Request(url, callback=self.parseBasicListingInfo,  meta={'item': item})

        def parseBasicListingInfo(item, response):
            item = response.request.meta['item']
            item['title'] = response.xpath('//span[@class="detail"]/text()').extract()
            yield item

答案 1 :(得分:1)

这是造成问题的原因:

item['listurl'] = sel.xpath('//a[@id="link101"]/@href').extract()[0]

&#34; //&#34;意味着&#34;从文件的开头&#34;这意味着它从第一个标签扫描并始终找到相同的第一个链接。您需要做的是使用&#34; .//&#34相对于当前标记的开头进行搜索;这意味着&#34;从此标签开始&#34;。您当前的for循环也是访问文档中不需要的每个标记。试试这个:

def parse(self, response):
    for href in response.xpath('//a[@id="link101"]/@href').extract():
        item = ProductionItem()
        item['listurl'] = href

        yield scrapy.Request(href,callback=self.parseBasicListingInfo,  meta={'item': item})

xpath将hrefs拉出链接并将它们作为可以迭代的列表返回。