Question

我需要从网站页面获取所有链接。但是，它似乎无法从start_url中指定的域中获取页面。这是我的蜘蛛：

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from mp3.items import *
import re

class Mp3Spider(CrawlSpider):
    name = "mp3"
    start_urls = ['http://mp3skull.com']
    # allowed_domains= ['mp3skull.com']
    rules = [
        Rule(SgmlLinkExtractor(allow=[r'mp3/\w+']), callback = 'parse_post',
        follow= True)
    ]

    def parse_post(self, response):
        item = PostItem()           
        item['url'] = response.url
        if item['url'][0].endswith('.mp3'):
            return item

我希望获得具有mp3扩展名的网址，但网址不同。其中一个网址是http://uhmp3.com/user-mp3-to/8-all-about-that-bass-by-meghan-trainor.mp3 获取域内所有网址的最佳方法是什么？

Answer 1

你的规则

Rule(SgmlLinkExtractor(allow=[r'mp3/\w+']), callback = 'parse_post', follow= True)

仅允许其绝对网址包含＆＃39; mp3 /＆＃39;被提取。这就是为什么你不能提取其他域名的原因。

您可以在allow中添加正则表达式以包含其他链接，例如：

Rule(SgmlLinkExtractor(allow=[r'mp3/\w+', r'.mp3$']), callback = 'parse_post', follow= True)

刮刀只抓取start_url中定义的域的URL吗？

1 个答案: