我正在研究一个抓取项目,并尝试获取乐队的每个认可链接。
我的代码如下:
它什么也没返回。但是,如果我将波段的每个URL放在start_url
中,则效果很好。但是由于我什至不确定有多少个URL,所以很难手动将所有想要的URL放入start_url
字段中。
显示日志:
任何人都可以帮忙吗?预先感谢!
答案 0 :(得分:0)
您的限制性xpath表达式看起来不正确。
您可以改用allow
参数,这很容易:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'celebrityendorsers.com'
start_urls = ['https://celebrityendorsers.com/endorsement/']
rules = (
Rule(LinkExtractor('/endorsements/'), callback='parse_url_contents'),
)
def parse_url_contents(self, response):
pass
这是输出日志:
2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/playtex-wipes/> (referer: https://celebrityendorsers.com/endorsement/)
2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/plenish-cleanse/> (referer: https://celebrityendorsers.com/endorsement/)
2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/play-date-by-sarah-beckham/> (referer: https://celebrityendorsers.com/endorsement/)
2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/playstation-3/> (referer: https://celebrityendorsers.com/endorsement/)
2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/playmg/> (referer: https://celebrityendorsers.com/endorsement/)
2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/playsight/> (referer: https://celebrityendorsers.com/endorsement/)
2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/play-cloths/> (referer: https://celebrityendorsers.com/endorsement/)
2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/platinum-league-trading-cards/> (referer: https://celebrityendorsers.com/endorsement/)
2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/playstation/> (referer: https://celebrityendorsers.com/endorsement/)
2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/platinum-group/> (referer: https://celebrityendorsers.com/endorsement/)
如果您确实要使用xpath,请尝试删除[*]
。
您注释的xpath看起来正确,但是回调是错误的,您不能将parse
回调与CrawlSpider
一起使用。