Question

我刚开始使用Scrapy进行Web Scraping。我已经阅读了几个指向html页面进行抓取的文档。我在eentertainment网站上尝试过，我试图仅删除图像的标题。后来关于价格和形象。写作时，我无法得到任何东西。任何人都可以指出我做错了。

这是代码。

# -*- coding: utf-8 -*-
import scrapy


class EeentertainmentSpider(scrapy.Spider):
    name = 'eeentertainment'
    allowed_domains = ['www.entertainmentearth.com/exclusives.asp']
    start_urls = ['http://www.entertainmentearth.com/exclusives.asp/']


    def parse(self, response):
        #Extracting the content using css selectors
        titles = response.css('.title::text').extract()


        #Give the extracted content row wise
        for item in zip(titles):
            #create a dictionary to store the scraped info
            scraped_info = {
                'title' : item[0],

            }

            #yield or give the scraped info to scrapy
            yield scraped_info
        pass

这是网页检查元素： -

Answer 1

你的蜘蛛有几个问题：

allowed_domains列表应该只包含域名，而不是确切的网址（请参阅documentation）
start_urls中的网址结尾为/（应显示为http://www.entertainmentearth.com/exclusives.asp）
我不确定你在这里尝试使用zip做什么，但我几乎可以肯定它并不打算

pass

parse是多余的

根据我提供的屏幕截图，我试图从页面中抓取图像标题。为此，并考虑到上述注释，请参阅适用的适用蜘蛛代码：

# -*- coding: utf-8 -*-
import scrapy

class EeentertainmentSpider(scrapy.Spider):
    name = 'eeentertainment'
    allowed_domains = ['entertainmentearth.com']
    start_urls = ['http://www.entertainmentearth.com/exclusives.asp']

    def parse(self, response):
        titles = response.css('img::attr(title)').extract()
        for title in titles:
            scraped_info = {
                'title' : title,
            }
            yield scraped_info

Scrapy没有退回任何废品

1 个答案: