因此,我有一个项目要从class-central.com抓取数据。我使用选择器小工具获取了几个字段的.css标记,但是运行程序时,我得到的只是一个空字段。我无法识别代码中的错误,请提供帮助! (PS:我是新手,这是我的第一个项目。所以请确保您的回答对像我这样的人来说是可以理解的)
打开以下链接以查看字段:https://www.classcentral.com/subject/cs
我无法抓取以下字段:
评分:该课程已被授予星星数和评论数。
import scrapy
from ..items import ClasscentralItem
class ClassCentral(scrapy.Spider):
name = 'spidy'
start_urls = [
'https://www.classcentral.com/subject/cs'
]
def parse(self, response):
items = ClasscentralItem()
all_tr = response.css('.xlarge-up-width-9-16')
courses = response.css('.number-of-courses .text--bold::text')
for x in all_tr:
sub = response.css('.medium-up-head-1::text').extract()
course_name = x.css('.course-name .text-- bold::text').extract()
course_devloper = x.css('.uni-name::text').extract()
via = x.css('.hover-initiativelinks , #course-listing-tbody .text--italic::text').extract()
duration = x.css('.icon-clock-charcoal::text').extract()
start_date = x.css('#course-listing-tbody .medium-only-hidden::text').extract()
rating =x.css('.review-rating::text').extract()
items['subjectname'] = sub
items['course_name'] = course_name
items['course_devloper'] = course_devloper
items['via'] = via
items['duration'] = duration
items['start_date'] = start_date
items['rating'] = rating
yield items
答案 0 :(得分:1)
实际上,您的all_tr
只是所有Course Name
列的列表(而不是所有表行)。这就是为什么您无法从start_date
(在另一列中)中获取x
的原因。
def parse(self, response):
items = {}
all_tr = response.css('#course-listing-tbody tr')
courses = response.css('.number-of-courses .text--bold::text')
for x in all_tr:
sub = response.css('.medium-up-head-1::text').extract()
course_name = x.css('.course-name .text--bold::text').get()
course_devloper = x.css('.uni-name::text').extract()
via = x.css('.text--italic::text').get()
duration = x.css('.icon-clock-charcoal::text').extract()
start_date = x.css('.medium-only-hidden::text').get()
rating =x.css('td:nth-child(4)').attrib['data-timestamp']
items['subjectname'] = sub
items['course_name'] = course_name
items['course_devloper'] = course_devloper
items['via'] = via
items['duration'] = duration
items['start_date'] = start_date
items['rating'] = rating
yield items
更新对于rating
,我抓住了第四列(评分)的data-timestamp
属性
如果查看页面源代码,您会发现某些行没有课程详细信息(广告行)。这就是为什么您在5个结果后得到一个错误。要获得所有课程,您需要修改all_tr
选择器:
all_tr = response.css('#course-listing-tbody tr[itemscope]')