我对scrapy很新,而且我一直在试图抓住http://www.icbse.com/schools/state/maharashtra,但我遇到了一个问题。 在显示为可用的学校链接总数中,该页面仅以无序方式一次显示50个。
但是,如果页面重新加载,它会显示50个新的学校链接列表。其中一些与刷新前的第一个链接不同,而有些则保持不变。
我想要做的是将链接添加到Set()
,一旦len(set)
达到总学校的长度,我想将Set
发送到解析功能。
我不明白解决这个问题的两件事。
set
来保留链接,并且每次调用parse()时都不会刷新。以下是我当前的代码:
import scrapy
import re
from icbse.items import IcbseItem
class IcbseSpider(scrapy.Spider):
name = "icbse"
allowed_domains = ["www.icbse.com"]
start_urls = [
"http://www.icbse.com/schools/",
]
def parse(self, response):
for i in xrange(20): # I thought if i iterate the start URL,
# I could probably have the page reload.
# It didn't work though.
for href in response.xpath(
'//div[@class="row"]/div[3]//span[@class="list-group-item"]\
/a/@href').extract():
url = response.urljoin(href)
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
# total number of schools found on page
pages = response.xpath(
"//div[@class='container']/strong/text()").extract()[0]
self.captured_schools_set = set() # Placing the Set here doesn't work!
while len(self.captured_schools_set) != int(pages):
yield scrapy.Request(response.url, callback=self.reload_url)
for school in self.captured_schools_set:
yield scrapy.Request(school, callback=self.scrape_school_info)
def reload_url(self, response):
for school_href in response.xpath(
"//h4[@class='school_name']/a/@href").extract():
self.captured_schools_set.add(response.urljoin(school_href))
def scrape_school_info(self, response):
item = IcbseItem()
try:
item["School_Name"] = response.xpath(
'//td[@class="tfield"]/strong/text()').extract()[0]
except:
item["School_Name"] = ''
pass
try:
item["streetAddress"] = response.xpath(
'//td[@class="tfield"]')[1].xpath(
"//span[@itemprop='streetAddress']/text()").extract()[0]
except:
item["streetAddress"] = ''
pass
yield item
答案 0 :(得分:2)
你正在迭代一个空集:
self.captured_schools_set = set() # Placing the Set here doesn't work!
while len(self.captured_schools_set) != int(pages):
yield scrapy.Request(response.url, callback=self.reload_url)
for school in self.captured_schools_set:
yield scrapy.Request(school, callback=self.scrape_school_info)
因此,school
的请求永远不会被解雇。
您应该使用dont_filter = True属性重新加载,触发http://www.icbse.com/schools/请求,因为在默认设置中,scrapy会过滤掉重复项。
但似乎您没有触发http://www.icbse.com/schools/个请求,而是(http://www.icbse.com/schools/state/andaman-nicobar)“/ state / name”请求;在上面的第4行,你正在解雇request.url,这是一个问题,改为/ schools /