我需要抓取多个网站,我只想抓取每个网站的特定页数。那么如何实现呢?
我的想法是使用一个字典,密钥是域名,值是mongodb中存储的页面数。因此,当页面被成功爬行并存储在数据库中时,该域的页面数将增加1。如果数量大于最大数量,那么蜘蛛应该停止从这个地点碾碎。
以下是我的代码,但它没有用。当spider.crawledPagesPerSite[domain_name]
大于spider.maximumPagesPerSite:
时,蜘蛛仍在爬行。
class AnExampleSpider(CrawlSpider):
name="anexample"
rules=(
Rule(LinkExtractor(allow=r"/*.html"),
callback="parse_url",follow=True),
)
def __init__(self, url_file ): #, N=10,*a, **kw
data = open(url_file, 'r').readlines() #[:N]
self.allowed_domains = [ i.strip() for i in data ]
self.start_urls = ['http://' + domain for domain in self.allowed_domains]
super(AnExampleSpider, self).__init__()#*a, **kw
self.maximumPagesPerSite=100 #maximum pages each site
self.crawledPagesPerSite={}
def parse_url(self, response):
url=response.url
item=AnExampleItem()
html_text=response.body
extracted_text=parse_page.parse_page(html_text)
item["url"]=url
item["extracted_text"]=extracted_text
return item
class MongoDBPipeline(object):
def __init__(self):
self.connection = pymongo.MongoClient( settings['MONGODB_SERVER'], settings['MONGODB_PORT'] )
def process_item(self, item, spider):
domain_name=tldextract.extract(item['url']).domain
db = self.connection[domain_name] #use domain name as database name
self.collection = db[settings['MONGODB_COLLECTION']]
valid = True
for data in item:
if not data:
valid = False
raise DropItem("Missing {0}!".format(data))
if valid:
self.collection.insert(dict(item))
log.msg("Item added to MongoDB database!",level=log.DEBUG, spider=spider)
if domain_name in spider.crawledPagesPerSite:
spider.crawledPagesPerSite[domain_name]+=1
else:
spider.crawledPagesPerSite[domain_name]=1
if spider.crawledPagesPerSite[domain_name]>spider.maximumPagesPerSite:
suffix=tldextract.extract(item['url']).suffix
domain_and_suffix=domain_name+"."+suffix
if domain_and_suffix in spider.allowed_domains:
spider.allowed_domains.remove(domain_and_suffix)
spider.rules[0].link_extractor.allow_domains.remove(domain_and_suffix)
return None
return item
答案 0 :(得分:0)
怎么样:
def parse_url(self, response):
url = response.url
domain_name = tldextract.extract(url).domain
if domain_name in self.crawledPagesPerSite:
# If enough page visited in this domain, return
if self.crawledPagesPerSite[domain_name] > self.maximumPagesPerSite:
return
self.crawledPagesPerSite[domain_name]+=1
else:
self.crawledPagesPerSite[domain_name]=1
print self.crawledPagesPerSite[domain_name]
答案 1 :(得分:0)
我不确定这是否是您要查找的内容,但是我使用这种方法仅抓取一定数量的页面。假设我只想从example.com抓取开始的99页,我将通过以下方式进行处理:
start_urls = ["https://example.com/page-%s.htm" % page for page in list(range(100))]
该代码将在到达第99页后停止工作。但这仅在您的网址中包含页码的情况下有效。
答案 2 :(得分:0)
我本人还是Scrapy的初学者,但是我结合了其他StackOverflow帖子中的两个答案,以找到适合我的解决方案。假设您要在N页之后停止抓取,然后可以导入 CloseSpider 异常,如下所示:
# To import it :
from scrapy.exceptions import CloseSpider
#Later to use it:
raise CloseSpider('message')
例如,您可以将其集成到解析器中,以在N个网址后关闭蜘蛛:
N = 10 # Here change 10 to how many you want.
count = 0 # The count starts at zero.
def parse(self, response):
# Return if more than N
if self.count >= self.N:
raise CloseSpider(f"Scarped {self.N} items. Eject!")
# Increment to count by one:
self.count += 1
# Put here the rest the code for parsing
链接到我发现的原始帖子: