我正在尝试使用scrapy从here获取结果。问题是,在单击“加载更多结果”选项卡之前,并非所有类都显示在页面上。
问题可以在这里看到:
我的代码如下所示:
class ClassCentralSpider(CrawlSpider):
name = "class_central"
allowed_domains = ["www.class-central.com"]
start_urls = (
'https://www.class-central.com/courses/recentlyAdded',
)
rules = (
Rule(
LinkExtractor(
# allow=("index\d00\.html",),
restrict_xpaths=('//div[@id="show-more-courses"]',)
),
callback='parse',
follow=True
),
)
def parse(self, response):
x = response.xpath('//span[@class="course-name-text"]/text()').extract()
item = ClasscentralItem()
for y in x:
item['name'] = y
print item['name']
pass
答案 0 :(得分:1)
此网站的第二页似乎是通过AJAX调用生成的。如果您查看任何浏览器检查工具的网络选项卡,您将看到如下内容:
在这种情况下,它似乎是从https://www.class-central.com/maestro/courses/recentlyAdded?page=2&_=1469471093134
检索json文件现在看来url参数_=1469471093134
什么也没做,所以你可以把它修剪成:https://www.class-central.com/maestro/courses/recentlyAdded?page=2
返回json包含下一页的html代码:
# so you just need to load it up with
data = json.loads(response.body)
# and convert it to scrapy selector -
sel = Selector(text=data['table'])
要在代码中复制此内容,请尝试以下操作:
from w3lib.url import add_or_replace_parameter
def parse(self, response):
# check if response is json, if so convert to selector
if response.meta.get('is_json',False):
# convert the json to scrapy.Selector here for parsing
sel = Selector(text=json.loads(response.body)['table'])
else:
sel = Selector(response)
# parse page here for items
x = sel.xpath('//span[@class="course-name-text"]/text()').extract()
item = ClasscentralItem()
for y in x:
item['name'] = y
print(item['name'])
# do next page
next_page_el = respones.xpath("//div[@id='show-more-courses']")
if next_page_el: # there is next page
next_page = response.meta.get('page',1) + 1
# make next page url
url = add_or_replace_parameter(url, 'page', next_page)
yield Request(url, self.parse, meta={'page': next_page, 'is_json': True)