因此,我的蜘蛛会收集一系列网站,并通过request
item
meta
传递item
yield
来抓取每个网站。
然后,蜘蛛会探索单个网站的所有内部链接,并将所有外部链接收集到item
中。问题是我不知道蜘蛛什么时候爬完所有内部链接,所以我不能class WebsiteSpider(scrapy.Spider):
name = "web"
def start_requests(self):
filename = "websites.csv"
requests = []
try:
with open(filename, 'r') as csv_file:
reader = csv.reader(csv_file)
header = next(reader)
for row in reader:
seed_url = row[1].strip()
item = Links(base_url=seed_url, on_list=[])
request = Request(seed_url, callback=self.parse_seed)
request.meta['item'] = item
requests.append(request)
return requests
except IOError:
raise scrapy.exceptions.CloseSpider("A list of websites are needed")
def parse_seed(self, response):
item = response.meta['item']
netloc = urlparse(item['base_url']).netloc
external_le = LinkExtractor(deny_domains=netloc)
external_links = external_le.extract_links(response)
for external_link in external_links:
item['on_list'].append(external_link)
internal_le = LinkExtractor(allow_domains=netloc)
internal_links = internal_le.extract_links(response)
for internal_link in internal_links:
request = Request(internal_link, callback=self.parse_seed)
request.meta['item'] = item
yield request
git clone https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
。
Cloning into 'linux'...
error: Couldn't resolve host 'git.kernel.org' while accessing https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/info/refs
fatal: HTTP request failed
答案 0 :(得分:0)
start_requests
方法需要生成Request
个对象。您不需要返回请求列表,但只在请求准备就绪时才生成请求,这是因为scrapy请求是异步的。
项目也是如此,你只需要在你认为物品准备就绪时就产生你的物品,我建议您只检查是否有internal_links
来生产物品,或者您可以根据需要添加任意数量的项目,然后检查哪一项是最后一项(或具有更多数据的项目):
class WebsiteSpider(scrapy.Spider):
name = "web"
def start_requests(self):
filename = "websites.csv"
requests = []
try:
with open(filename, 'r') as csv_file:
reader = csv.reader(csv_file)
header = next(reader)
for row in reader:
seed_url = row[1].strip()
item = Links(base_url=seed_url, on_list=[])
yield Request(seed_url, callback=self.parse_seed, meta = {'item'=item})
except IOError:
raise scrapy.exceptions.CloseSpider("A list of websites are needed")
def parse_seed(self, response):
item = response.meta['item']
netloc = urlparse(item['base_url']).netloc
external_le = LinkExtractor(deny_domains=netloc)
external_links = external_le.extract_links(response)
for external_link in external_links:
item['on_list'].append(external_link)
internal_le = LinkExtractor(allow_domains=netloc)
internal_links = internal_le.extract_links(response)
if internal_links:
for internal_link in internal_links:
request = Request(internal_link, callback=self.parse_seed)
request.meta['item'] = item
yield request
else:
yield item
你可以做的另一件事是创建一个extension来做你需要的spider_closed
方法,并做任何你想知道什么时候蜘蛛结束。