Question

因此，我有一个项目要从class-central.com抓取数据。我使用选择器小工具获取了几个字段的.css标记，但是运行程序时，我得到的只是一个空字段。我无法识别代码中的错误，请提供帮助！（PS：我是新手，这是我的第一个项目。所以请确保您的回答对像我这样的人来说是可以理解的）

打开以下链接以查看字段：https://www.classcentral.com/subject/cs

我无法抓取以下字段：

开始日期：课程开始的日期。
via：托管课程的网站（例如，Coursera）。

评分：该课程已被授予星星数和评论数。

import scrapy
  from ..items import ClasscentralItem
  class ClassCentral(scrapy.Spider):
      name = 'spidy'
      start_urls = [
          'https://www.classcentral.com/subject/cs'
      ]
  def parse(self, response):
      items = ClasscentralItem()
      all_tr = response.css('.xlarge-up-width-9-16')
      courses = response.css('.number-of-courses .text--bold::text')
      for x in all_tr:
          sub = response.css('.medium-up-head-1::text').extract()
          course_name = x.css('.course-name .text--        bold::text').extract()
          course_devloper = x.css('.uni-name::text').extract()
          via = x.css('.hover-initiativelinks , #course-listing-tbody .text--italic::text').extract()
          duration = x.css('.icon-clock-charcoal::text').extract()
          start_date = x.css('#course-listing-tbody .medium-only-hidden::text').extract()
          rating =x.css('.review-rating::text').extract()
          items['subjectname'] = sub
          items['course_name'] = course_name
          items['course_devloper'] = course_devloper
          items['via'] = via
          items['duration'] = duration
          items['start_date'] = start_date
          items['rating'] = rating
          yield items

Answer 1

实际上，您的all_tr只是所有Course Name列的列表（而不是所有表行）。这就是为什么您无法从start_date（在另一列中）中获取x的原因。

def parse(self, response):

    items = {}
    all_tr = response.css('#course-listing-tbody tr')
    courses = response.css('.number-of-courses .text--bold::text')
    for x in all_tr:
        sub = response.css('.medium-up-head-1::text').extract()
        course_name = x.css('.course-name .text--bold::text').get()
        course_devloper = x.css('.uni-name::text').extract()
        via = x.css('.text--italic::text').get()
        duration = x.css('.icon-clock-charcoal::text').extract()
        start_date = x.css('.medium-only-hidden::text').get()
        rating =x.css('td:nth-child(4)').attrib['data-timestamp']
        items['subjectname'] = sub
        items['course_name'] = course_name
        items['course_devloper'] = course_devloper
        items['via'] = via
        items['duration'] = duration
        items['start_date'] = start_date
        items['rating'] = rating
        yield items

更新对于rating，我抓住了第四列（评分）的data-timestamp属性

如果查看页面源代码，您会发现某些行没有课程详细信息（广告行）。这就是为什么您在5个结果后得到一个错误。要获得所有课程，您需要修改all_tr选择器：

all_tr = response.css('#course-listing-tbody tr[itemscope]')

无法使用CSS选择器和选择器小工具删除少量字段

1 个答案: