无法使用CSS选择器和选择器小工具删除少量字段

时间:2019-10-13 07:37:39

标签: python-3.x web-scraping scrapy

因此,我有一个项目要从class-central.com抓取数据。我使用选择器小工具获取了几个字段的.css标记,但是运行程序时,我得到的只是一个空字段。我无法识别代码中的错误,请提供帮助! (PS:我是新手,这是我的第一个项目。所以请确保您的回答对像我这样的人来说是可以理解的)

打开以下链接以查看字段:https://www.classcentral.com/subject/cs

我无法抓取以下字段:

  1. 开始日期:课程开始的日期。
  2. via:托管课程的网站(例如,Coursera)。
  3. 评分:该课程已被授予星星数和评论数。

    import scrapy
      from ..items import ClasscentralItem
      class ClassCentral(scrapy.Spider):
          name = 'spidy'
          start_urls = [
              'https://www.classcentral.com/subject/cs'
          ]
      def parse(self, response):
          items = ClasscentralItem()
          all_tr = response.css('.xlarge-up-width-9-16')
          courses = response.css('.number-of-courses .text--bold::text')
          for x in all_tr:
              sub = response.css('.medium-up-head-1::text').extract()
              course_name = x.css('.course-name .text--        bold::text').extract()
              course_devloper = x.css('.uni-name::text').extract()
              via = x.css('.hover-initiativelinks , #course-listing-tbody .text--italic::text').extract()
              duration = x.css('.icon-clock-charcoal::text').extract()
              start_date = x.css('#course-listing-tbody .medium-only-hidden::text').extract()
              rating =x.css('.review-rating::text').extract()
              items['subjectname'] = sub
              items['course_name'] = course_name
              items['course_devloper'] = course_devloper
              items['via'] = via
              items['duration'] = duration
              items['start_date'] = start_date
              items['rating'] = rating
              yield items
    

1 个答案:

答案 0 :(得分:1)

实际上,您的all_tr只是所有Course Name列的列表(而不是所有表行)。这就是为什么您无法从start_date(在另一列中)中获取x的原因。

def parse(self, response):

    items = {}
    all_tr = response.css('#course-listing-tbody tr')
    courses = response.css('.number-of-courses .text--bold::text')
    for x in all_tr:
        sub = response.css('.medium-up-head-1::text').extract()
        course_name = x.css('.course-name .text--bold::text').get()
        course_devloper = x.css('.uni-name::text').extract()
        via = x.css('.text--italic::text').get()
        duration = x.css('.icon-clock-charcoal::text').extract()
        start_date = x.css('.medium-only-hidden::text').get()
        rating =x.css('td:nth-child(4)').attrib['data-timestamp']
        items['subjectname'] = sub
        items['course_name'] = course_name
        items['course_devloper'] = course_devloper
        items['via'] = via
        items['duration'] = duration
        items['start_date'] = start_date
        items['rating'] = rating
        yield items

更新对于rating,我抓住了第四列(评分)的data-timestamp属性

如果查看页面源代码,您会发现某些行没有课程详细信息(广告行)。这就是为什么您在5个结果后得到一个错误。要获得所有课程,您需要修改all_tr选择器:

all_tr = response.css('#course-listing-tbody tr[itemscope]')