我刚开始使用Scrapy进行Web Scraping。我已经阅读了几个指向html页面进行抓取的文档。我在eentertainment网站上尝试过,我试图仅删除图像的标题。后来关于价格和形象。写作时,我无法得到任何东西。任何人都可以指出我做错了。
这是代码。
# -*- coding: utf-8 -*-
import scrapy
class EeentertainmentSpider(scrapy.Spider):
name = 'eeentertainment'
allowed_domains = ['www.entertainmentearth.com/exclusives.asp']
start_urls = ['http://www.entertainmentearth.com/exclusives.asp/']
def parse(self, response):
#Extracting the content using css selectors
titles = response.css('.title::text').extract()
#Give the extracted content row wise
for item in zip(titles):
#create a dictionary to store the scraped info
scraped_info = {
'title' : item[0],
}
#yield or give the scraped info to scrapy
yield scraped_info
pass
答案 0 :(得分:1)
你的蜘蛛有几个问题:
allowed_domains
列表应该只包含域名,而不是确切的网址(请参阅documentation)start_urls
中的网址结尾为/
(应显示为http://www.entertainmentearth.com/exclusives.asp
)zip
做什么,但我几乎可以肯定它并不打算pass
方法结束时parse
是多余的根据我提供的屏幕截图,我试图从页面中抓取图像标题。为此,并考虑到上述注释,请参阅适用的适用蜘蛛代码:
# -*- coding: utf-8 -*-
import scrapy
class EeentertainmentSpider(scrapy.Spider):
name = 'eeentertainment'
allowed_domains = ['entertainmentearth.com']
start_urls = ['http://www.entertainmentearth.com/exclusives.asp']
def parse(self, response):
titles = response.css('img::attr(title)').extract()
for title in titles:
scraped_info = {
'title' : title,
}
yield scraped_info