Scrapy没有退回任何废品

时间:2018-02-16 03:45:23

标签: python scrapy

我刚开始使用Scrapy进行Web Scraping。我已经阅读了几个指向html页面进行抓取的文档。我在eentertainment网站上尝试过,我试图仅删除图像的标题。后来关于价格和形象。写作时,我无法得到任何东西。任何人都可以指出我做错了。

这是代码。

# -*- coding: utf-8 -*-
import scrapy


class EeentertainmentSpider(scrapy.Spider):
    name = 'eeentertainment'
    allowed_domains = ['www.entertainmentearth.com/exclusives.asp']
    start_urls = ['http://www.entertainmentearth.com/exclusives.asp/']


    def parse(self, response):
        #Extracting the content using css selectors
        titles = response.css('.title::text').extract()


        #Give the extracted content row wise
        for item in zip(titles):
            #create a dictionary to store the scraped info
            scraped_info = {
                'title' : item[0],

            }

            #yield or give the scraped info to scrapy
            yield scraped_info
        pass

这是网页检查元素: - enter image description here

1 个答案:

答案 0 :(得分:1)

你的蜘蛛有几个问题:

  • allowed_domains列表应该只包含域名,而不是确切的网址(请参阅documentation
  • start_urls中的网址结尾为/(应显示为http://www.entertainmentearth.com/exclusives.asp
  • 我不确定你在这里尝试使用zip做什么,但我几乎可以肯定它并不打算
  • pass方法结束时
  • parse是多余的

根据我提供的屏幕截图,我试图从页面中抓取图像标题。为此,并考虑到上述注释,请参阅适用的适用蜘蛛代码:

# -*- coding: utf-8 -*-
import scrapy

class EeentertainmentSpider(scrapy.Spider):
    name = 'eeentertainment'
    allowed_domains = ['entertainmentearth.com']
    start_urls = ['http://www.entertainmentearth.com/exclusives.asp']

    def parse(self, response):
        titles = response.css('img::attr(title)').extract()
        for title in titles:
            scraped_info = {
                'title' : title,
            }
            yield scraped_info