我正在学习使用scrapy并且作为练习我正在编写蜘蛛来抓取不同的网站。在此示例中:https://www.thuisbezorgd.nl/eten-bestellen-castricum和https://www.iens.nl/restaurant+zoetermeer。有些事情我不明白;让我们比较这两种蜘蛛:
蜘蛛1:
import scrapy
from datetime import datetime
from scrapy import Request
import urllib.parse as urlparse
from scrapy.loader import ItemLoader
from iensScraper.items import IensscraperItem
from scrapy.crawler import CrawlerProcess
class IensSpider(scrapy.Spider):
name ="ienzz"
start_urls = ['https://www.iens.nl/restaurant+zoetermeer']
domain = ['https://www.iens.nl/']
def parse(self, response):
restaurants = response.css('.resultItem')
items = [(restaurant.css('[href]::text').extract_first(),restaurant.css('.resultItem-address::text').extract_first(),restaurant.css('.rating-ratingValue::text').extract_first(), restaurant.css('.reviewsCount>[href]::text').extract_first()) for restaurant in restaurants]
for item in items:
holder = ItemLoader(item=IensscraperItem(),response=response)
holder.add_value('naam',item[0])
holder.add_value('adres',item[1])
holder.add_value('rating',item[2])
holder.add_value('recensies',item[3])
yield holder.load_item()
if(response.xpath('//*[@class="next"]//@href').extract()):
link = response.css('.next>a::attr(href)').extract()
yield Request(urlparse.urljoin(response.url,link[0]),callback=self.parse,dont_filter=True)
process = CrawlerProcess()
process.crawl(Iensspider)
process.start()
输出蜘蛛1:
2017-01-19 11:48:20 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.iens.nl/restaurant+zoetermeer?page=2>
{'adres': ['\n'
' Middelwaard 86 2716 CW Zoetermeer\n'
' '],
'naam': ['Meerzicht']}
2017-01-19 11:48:20 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.iens.nl/restaurant+zoetermeer?page=2>
{'adres': ['\n'
' Burgemeester van Leeuwenpassage 2 2711 '
'JV Zoetermeer\n'
' '],
'naam': ['Brandcafé Zoetermeer']}
2017-01-19 11:48:20 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.iens.nl/restaurant+zoetermeer?page=2>
{'adres': ['\n'
' Van der Hagenstraat 22 2722 NT '
'Zoetermeer\n'
' '],
'naam': ['Taste of Asia']}
蜘蛛2:
import scrapy
import urllib.parse as urlparse
from scrapy import Request
from scrapy.loader import ItemLoader
from scrapy.crawler import CrawlerProcess
from thuisbezorgdscraper.items import ThuisbezorgdscraperItem
class ThuisSpider(scrapy.Spider):
name = 'spiderman'
domain = ['https://www.thuisbezorgd.nl']
start_urls = ['https://www.thuisbezorgd.nl/eten-bestellen-castricum']
def parse(self, response):
raw_urls = response.css('.delarea')
urls = raw_urls.css('::attr(href)').extract()
for url in urls:
yield Request(urlparse.urljoin(response.url, url),callback=self.parse_item, dont_filter=True)
def parse_item(self, response):
restaurants = response.css('.restaurant.grid')
for restaurant in restaurants:
l = ItemLoader(item=ThuisbezorgdscraperItem(), response=response)
name = restaurant.css('.restaurantname[itemprop]::text').extract()
address = restaurant.css('.restaurantaddress::text').extract()
score = restaurant.css('.pointy::attr(title)').extract()
reviews = restaurant.css('.nrofreviews::text').extract()
l.add_value('address', address)
l.add_value('name',name)
l.add_value('score', score)
l.add_value('reviews', reviews)
yield l.load_item()
process = CrawlerProcess()
process.crawl(ThuisSpider)
process.start()
输出蜘蛛2:
2017-01-19 11:12:06 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.thuisbezorgd.nl/eten-bestellen-castricum-castricum-zuid-1901>
{'address': ['\n\t\t Rijksweg 2', '\t '],
'name': ['New York Pizza'],
'reviews': ['200 recensies'],
'score': ['Klantbeoordeling: 7 / 10']}
2017-01-19 11:12:06 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.thuisbezorgd.nl/eten-bestellen-castricum-castricum-noord-1902>
{'address': ['\n\t\t Heemskerkerweg 93', '\t '],
'name': ['Fresco'],
'reviews': ['1420 recensies'],
'score': ['Klantbeoordeling: 8 / 10']}
2017-01-19 11:12:06 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.thuisbezorgd.nl/eten-bestellen-castricum-centrum-1901>
{'address': ['\n\t\t Boulevard 13', '\t '],
'name': ['iSAFIA'],
'reviews': ['285 recensies'],
'score': ['Klantbeoordeling: 8 / 10']}
蜘蛛1:第一个蜘蛛抓取网站&#39; iens&#39;就像我想要的那样;在解析了起始网址中的所有信息之后,它使用索引来移动到下一页来解析该页面等。通过查看输出来确认此蜘蛛的功能。首先返回初始页面上的所有餐馆,然后返回第二页上的所有餐馆,直到没有剩余的页面为止。
蜘蛛2: 第二个蜘蛛的结构略有不同。它首先从开始URL中提取所需的URL,然后开始抓取提取的URL。我希望这只蜘蛛就像第一只蜘蛛一样行动;从第一个网址刮掉所有的餐馆,然后从第二个网址刮掉所有的餐馆,直到没有更多的网址。然而,第二个蜘蛛同时从所有提取的URL中擦除餐馆。这个蜘蛛从一个提取的网址中产生一个餐馆,然后从另一个提取的网址中产生另一个餐馆,直到没有剩下的餐馆。
问题:为什么这些蜘蛛表现得如此不同? (蜘蛛有相同的设置!)