如何使用scrapy从网站抓取有限数量的网页?

时间:2015-12-09 11:04:17

标签: python scrapy

我需要抓取多个网站,我只想抓取每个网站的特定页数。那么如何实现呢?

我的想法是使用一个字典,密钥是域名,值是mongodb中存储的页面数。因此,当页面被成功爬行并存储在数据库中时,该域的页面数将增加1。如果数量大于最大数量,那么蜘蛛应该停止从这个地点碾碎。

以下是我的代码,但它没有用。当spider.crawledPagesPerSite[domain_name]大于spider.maximumPagesPerSite:时,蜘蛛仍在爬行。

class AnExampleSpider(CrawlSpider):
name="anexample"
rules=(
    Rule(LinkExtractor(allow=r"/*.html"),
    callback="parse_url",follow=True),
)   
def __init__(self, url_file ): #, N=10,*a, **kw
    data = open(url_file, 'r').readlines() #[:N]
    self.allowed_domains = [ i.strip() for i in data ] 
    self.start_urls = ['http://' + domain for domain in self.allowed_domains]
    super(AnExampleSpider, self).__init__()#*a, **kw

    self.maximumPagesPerSite=100 #maximum pages each site
    self.crawledPagesPerSite={}
def parse_url(self, response):
    url=response.url
    item=AnExampleItem()     
    html_text=response.body
    extracted_text=parse_page.parse_page(html_text)
    item["url"]=url
    item["extracted_text"]=extracted_text
    return item

class MongoDBPipeline(object):
    def __init__(self):
        self.connection = pymongo.MongoClient( settings['MONGODB_SERVER'], settings['MONGODB_PORT'] )

    def process_item(self, item, spider):
        domain_name=tldextract.extract(item['url']).domain
        db = self.connection[domain_name] #use domain name as database name
        self.collection = db[settings['MONGODB_COLLECTION']]
        valid = True
        for data in item:
            if not data:
                valid = False
                raise DropItem("Missing {0}!".format(data))
            if valid:
                self.collection.insert(dict(item))
                log.msg("Item added to MongoDB database!",level=log.DEBUG, spider=spider)
                if domain_name in spider.crawledPagesPerSite:
                    spider.crawledPagesPerSite[domain_name]+=1
                else:
                    spider.crawledPagesPerSite[domain_name]=1
                if spider.crawledPagesPerSite[domain_name]>spider.maximumPagesPerSite:
                    suffix=tldextract.extract(item['url']).suffix
                    domain_and_suffix=domain_name+"."+suffix

                    if domain_and_suffix in spider.allowed_domains:
                        spider.allowed_domains.remove(domain_and_suffix)
                        spider.rules[0].link_extractor.allow_domains.remove(domain_and_suffix)
                        return None
                return item

3 个答案:

答案 0 :(得分:0)

怎么样:

def parse_url(self, response):
    url = response.url
    domain_name = tldextract.extract(url).domain
    if domain_name in self.crawledPagesPerSite:
        # If enough page visited in this domain, return
        if self.crawledPagesPerSite[domain_name] > self.maximumPagesPerSite:
            return 
        self.crawledPagesPerSite[domain_name]+=1

    else:
        self.crawledPagesPerSite[domain_name]=1
    print self.crawledPagesPerSite[domain_name]

答案 1 :(得分:0)

我不确定这是否是您要查找的内容,但是我使用这种方法仅抓取一定数量的页面。假设我只想从example.com抓取开始的99页,我将通过以下方式进行处理:

start_urls = ["https://example.com/page-%s.htm" % page for page in list(range(100))]

该代码将在到达第99页后停止工作。但这仅在您的网址中包含页码的情况下有效。

答案 2 :(得分:0)

我本人还是Scrapy的初学者,但是我结合了其他StackOverflow帖子中的两个答案,以找到适合我的解决方案。假设您要在N页之后停止抓取,然后可以导入 CloseSpider 异常,如下所示:

# To import it :
from scrapy.exceptions import CloseSpider


#Later to use it:
raise CloseSpider('message')

例如,您可以将其集成到解析器中,以在N个网址后关闭蜘蛛:

N = 10 # Here change 10 to how many you want.
count = 0 # The count starts at zero.

def parse(self, response):
    # Return if more than N
    if self.count >= self.N:
        raise CloseSpider(f"Scarped {self.N} items. Eject!")
    # Increment to count by one:
    self.count += 1

    # Put here the rest the code for parsing

链接到我发现的原始帖子:

  1. Force spider to stop in scrapy
  2. Scrapy: How to limit number of urls scraped in SitemapSpider