Scrapy通过每个域重新发布的URL来抓取多个域

时间:2016-11-17 20:53:19

标签: python python-2.7 scrapy scrapy-spider

我正在尝试抓取某些选定的域名,并仅从这些网站中获取必要的网页。我的方法是抓取域的一个网页并采用一组限制网址,这些网址会针对我在第一个网页上找到的重复发生的网址进行抓取。这样我就会尝试消除所有没有重复出现的网址(内容网址,例如产品等)。我寻求帮助的原因是因为scrapy.Request不会被执行多次。 这就是我到目前为止所做的:

class Finder(scrapy.Spider):
name = "finder"
start_urls = ['http://www.nu.nl/']
uniqueDomainUrl = dict()
maximumReoccurringPages = 5

rules = (
    Rule(
        LinkExtractor(
            allow=('.nl', '.nu', '.info', '.net', '.com', '.org', '.info'),
            deny=('facebook','amazon', 'wordpress', 'blogspot', 'free', 'reddit',
                  'videos', 'youtube', 'google', 'doubleclick', 'microsoft', 'yahoo',
                  'bing', 'znet', 'stackexchang', 'twitter', 'wikipedia', 'creativecommons',
                  'mediawiki', 'wikidata'),
        ),
        process_request='parse',
        follow=True
    ),
)

def parse(self, response):
    self.logger.info('Entering URL: %s', response.url)
    currentUrlParse = urlparse.urlparse( response.url )
    currentDomain = currentUrlParse.hostname
    if currentDomain in self.uniqueDomainUrl:
        yield

    self.uniqueDomainUrl[currentDomain] = currentDomain

    item = ImportUrlList()
    response.meta['item'] = item

    # Reoccurring URLs
    item = self.findReoccurringUrls(response)
    list = item['list']

    self.logger.info('Output: %s', list)

    # Crawl reoccurring urls
    #for href in list:
    #    yield scrapy.Request(response.urljoin(href), callback=self.parse)

def findReoccurringUrls(self, response):
    self.logger.info('Finding reoccurring URLs in: %s', response.url)

    item = response.meta['item']
    urls = self.findUrlsOnCurrentPage(response)
    item['list'] = urls
    response.meta['item'] = item

    # Get all URLs on each web page (limit 5 pages)
    i = 0
    for value in urls:
        i += 1
        if i > self.maximumReoccurringPages:
            break

        self.logger.info('Parse: %s', value)
        request = Request(value, callback=self.test, meta={'item':item})
        item = request.meta['item']

    return item

def test(self, response):
    self.logger.info('Page title: %s', response.css('title').extract())
    item = response.meta['item']
    urls = self.findUrlsOnCurrentPage( response )
    item['list'] = set(item['list']) & set(urls)
    return item

def findUrlsOnCurrentPage(self, response):
    newUrls = []
    currentUrlParse = urlparse.urlparse( response.url )
    currentDomain = currentUrlParse.hostname
    currentUrl = currentUrlParse.scheme +'://'+ currentUrlParse.hostname

    for href in response.css('a::attr(href)').extract():
        newUrl = urlparse.urljoin(currentUrl, href)

        urlParse = urlparse.urlparse(newUrl)
        domain = urlParse.hostname

        if href.startswith( '#' ):
            continue

        if domain != currentDomain:
            continue

        if newUrl not in newUrls:
            newUrls.append(newUrl)

    return newUrls

似乎只执行第一页,其他Request()没有被调用,因为我可以在回调中看到。

1 个答案:

答案 0 :(得分:0)

ImportUrlList()的作用是什么?你实现了吗?

你也忘了在findReoccuringUrls上调用scrapy.Request

request = scrapy.Request(value, callback=self.test, meta={'item':item})

def findReoccurringUrls(self, response):
    self.logger.info('Finding reoccurring URLs in: %s', response.url)

    item = response.meta['item']
    urls = self.findUrlsOnCurrentPage(response)
    item['list'] = urls
    response.meta['item'] = item

    # Get all URLs on each web page (limit 5 pages)
    i = 0
    for value in urls:
        i += 1
        if i > self.maximumReoccurringPages:
            break

        self.logger.info('Parse: %s', value)
        request = scrapy.Request(value, callback=self.test, meta={'item':item})
        item = request.meta['item']