我有一个蜘蛛,它以四个不同的start_urls
开头,然后继续爬网内部的某些链接。它们都具有相同的域和结构,唯一改变的是它们之间的查询参数。我使用两个规则:一个用于打开和解析每个链接,另一个用于跟踪分页。
我的问题是:由于分页产生的大量信息,我不想爬网所有链接,因此我需要检查爬网的每个链接的条件(发布年份),并且该年份不同与我想要的年份相比,蜘蛛应忽略对属于该start_url的所有其余链接的爬网,然后继续执行由第二个start_url
生成的链接。我该怎么做呢?这是我的蜘蛛的代码:
class articleSpider(CrawlSpider):
name = 'article'
allowed_domains = ['website.com']
start_urls = [
'https://www.website.com/search/?category=value1',
'https://www.website.com/search/?category=value2',
'https://www.website.com/search/?category=value3',
'https://www.website.com/search/?category=value4',
]
rules = (
Rule(
LinkExtractor(
restrict_xpaths="//div[@class='results-post']/article/a"
),
callback='parse_item',
follow=True,
),
Rule(
LinkExtractor(
restrict_xpaths="//section[@class='results-navi'][1]/div/div[@class='prevpageNav left']"
)
)
)
def parse_item(self, response):
name = response.url.strip('/').split('/')[-1]
date = response.xpath("//section/p/time/@datetime").get()[:4]
if date == '2020':
with open(f'./src/data/{name}.html', 'wb') as f:
f.write(response.text.encode('utf8'))
return
预先感谢您的帮助。
答案 0 :(得分:3)
我不知道实现此目标的简单方法,但是下面的(未经测试的)代码可以帮助您入门。 逻辑如下:
from scrapy import Spider, Request
class articleSpider(Spider):
name = 'article'
allowed_domains = ['website.com']
start_urls = [
'https://www.website.com/search/?category=value1',
'https://www.website.com/search/?category=value2',
'https://www.website.com/search/?category=value3',
'https://www.website.com/search/?category=value4',
]
def start_requests(self):
start_urls = self.start_urls
start_url = start_urls.pop()
meta = {'start_urls': start_urls}
yield Request(start_url, callback=self.parse, meta=meta)
def parse(self, response):
start_urls = response.meta['start_urls']
# get item-urls
item_urls = response.xpath(
'//div[@class="results-post"]/article/a'
).extract()
# get next page-url
next_page = response.xpath(
'//section[@class="results-navi"][1]/div/div[@class="prevpageNav left"]'
).extract_first()
# pass the item-urls and next page in the meta
item_url = item_urls.pop()
meta = {
'next_page': next_page,
'item_urls': item_urls,
'start_urls': start_urls
}
yield Request(item_url, self.parse_item, meta=meta)
def parse_item(self, response):
item_urls = response.meta['item_urls']
next_page = response.meta['next_page']
start_urls = response.meta['start_urls']
name = response.url.strip('/').split('/')[-1]
date = response.xpath("//section/p/time/@datetime").get()[:4]
if date == '2020':
with open(f'./src/data/{name}.html', 'wb') as f:
f.write(response.text.encode('utf8'))
try:
item_url = item_urls.pop()
except IndexError:
# all items are done - we go to next page
if next_page:
meta = {'start_urls': start_urls}
yield Request(next_page, self.parse, meta=meta)
else:
# no pages left, go to next start_url
try:
start_url = start_urls.pop()
except IndexError:
# nothing left to do
return
else:
meta = {'start_urls': start_urls}
yield Request(start_url, self.parse, meta=meta)
else:
# still items left to process
meta = {
'next_page': next_page,
'item_urls': item_urls
}
yield Request(item_url, self.parse_item, meta=meta)
else:
# go to next start_url
try:
start_url = start_urls.pop()
except IndexError:
# nothing left to do
return
else:
meta = {'start_urls': start_urls}
yield Request(start_url, self.parse, meta=meta)