Scrapy Debug爬行了200,没有任何回报

时间:2018-11-20 19:20:41

标签: python web-scraping scrapy web-crawler scrapy-spider

我正在研究一个抓取项目,并尝试获取乐队的每个认可链接。

我的代码如下:

my code

它什么也没返回。但是,如果我将波段的每个URL放在start_url中,则效果很好。但是由于我什至不确定有多少个URL,所以很难手动将所有想要的URL放入start_url字段中。

显示日志:

log

任何人都可以帮忙吗?预先感谢!

1 个答案:

答案 0 :(得分:0)

您的限制性xpath表达式看起来不正确。

您可以改用allow参数,这很容易:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class MySpider(CrawlSpider):

    name = 'celebrityendorsers.com'
    start_urls = ['https://celebrityendorsers.com/endorsement/']

    rules = (
        Rule(LinkExtractor('/endorsements/'), callback='parse_url_contents'),
    )

    def parse_url_contents(self, response):
        pass

这是输出日志:

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/playtex-wipes/> (referer: https://celebrityendorsers.com/endorsement/)
2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/plenish-cleanse/> (referer: https://celebrityendorsers.com/endorsement/)
2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/play-date-by-sarah-beckham/> (referer: https://celebrityendorsers.com/endorsement/)
2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/playstation-3/> (referer: https://celebrityendorsers.com/endorsement/)
2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/playmg/> (referer: https://celebrityendorsers.com/endorsement/)
2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/playsight/> (referer: https://celebrityendorsers.com/endorsement/)
2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/play-cloths/> (referer: https://celebrityendorsers.com/endorsement/)
2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/platinum-league-trading-cards/> (referer: https://celebrityendorsers.com/endorsement/)
2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/playstation/> (referer: https://celebrityendorsers.com/endorsement/)
2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/platinum-group/> (referer: https://celebrityendorsers.com/endorsement/)

如果您确实要使用xpath,请尝试删除[*]

您注释的xpath看起来正确,但是回调是错误的,您不能将parse回调与CrawlSpider一起使用。