我正在获取空白的csv,尽管它没有在代码中显示任何错误。 它无法通过网页进行爬网。
这是我写的有关youtube的代码:-
import scrapy
from Example.items import MovieItem
class ThirdSpider(scrapy.Spider):
name = "imdbtestspider"
allowed_domains = ["imdb.com"]
start_url = ('http://www.imdb.com/chart/top',)
def parse(self,response):
links = response.xpath('//tbody[@class="lister-list"]/tr/td[@class="titleColumn"]/a/@href').extract()
i = 1
for link in links:
abs_url = response.urljoin(link)
#
url_next = '//*[@id="main"]/div/span/div/div/div[2]/table/tbody/tr['+str(i)+']/td[3]/strong/text()'
rating = response.xpath(url_next).extact()
if (i <= len(link)):
i=i+1
yield scrapy.Request(abs_url, callback = self.parse_indetail, meta = {'rating': rating})
def parse_indetail(self,response):
item = MovieItem()
#
item['title'] = response.xpath('//div[@class="title_wrapper"])/h1/text()').extract[0][:-1]
item['directors'] = response.xpath('//div[@class="credit_summary_items"]/span[@itemprop="director"]/a/span/text()').extract()[0]
item['writers'] = response.xpath('//div[@class="credit_summary_items"]/span[@itemprop="creator"]/a/span/text()').extract()
item['stars'] = response.xpath('//div[@class="credit_summary_items"]/span[@itemprop="actors"]/a/span/text()').extract()
item['popularity'] = response.xpath('//div[@class="titleReviewBarSubItem"]/div/span/text()').extract()[2][21:-8]
return item
这是我在执行代码时得到的输出
scrapy crawl imdbtestspider -o example.csv -t csv
2019-01-17 18:44:34 [scrapy.core.engine]信息:蜘蛛开了 2019-01-17 18:44:34 [scrapy.extensions.logstats]信息:检索到0页 (以0 pag es / min的速度),刮擦0件(以0 items / min的速度)
答案 0 :(得分:1)
这是您可以尝试的另一种方法。我使用css选择器而不是xpath使脚本不太冗长。
import scrapy
class ImbdsdpyderSpider(scrapy.Spider):
name = 'imbdspider'
start_urls = ['http://www.imdb.com/chart/top']
def parse(self, response):
for link in response.css(".titleColumn a[href^='/title/']::attr(href)").extract():
yield scrapy.Request(response.urljoin(link),callback=self.get_info)
def get_info(self, response):
item = {}
title = response.css(".title_wrapper h1::text").extract_first()
item['title'] = ' '.join(title.split()) if title else None
item['directors'] = response.css(".credit_summary_item h4:contains('Director') ~ a::text").extract()
item['writers'] = response.css(".credit_summary_item h4:contains('Writer') ~ a::text").extract()
item['stars'] = response.css(".credit_summary_item h4:contains('Stars') ~ a::text").extract()
popularity = response.css(".titleReviewBarSubItem:contains('Popularity') .subText::text").extract_first()
item['popularity'] = ' '.join(popularity.split()).strip("(") if popularity else None
item['rating'] = response.css(".ratingValue span::text").extract_first()
yield item
答案 1 :(得分:0)
我已经根据xpaths
给您测试过,但我不知道它们是错误的还是实际上是错误的。
例如;
xpath = //*="main"]/div/span/div/div/div[2]/table/tbody/tr['+str(i)+']/td[3]/strong/text()
#There is not table when you reach at div[2]
//div[@class="title_wrapper"])/h1/text() #here there is and error after `]` ) is bad syntax
另外,您的xpath不会产生任何结果。
答案 2 :(得分:0)
关于为什么尽管没有重新创建案例,却出现0 / pages爬行的错误的原因,我必须假设您的页面迭代方法未正确构建页面URL。
我无法理解如何创建所有“跟随链接”的变量数组,然后使用len将它们发送到parse_indetail(),但要注意一些问题。
应该是这样的...
def parse(self,response):
# If you are going to capture an item at the first request, you must instantiate
# your items class
item = MovieItem()
....
# You seem to want to pass ratings to the next function for itimization, so
# you make sure that you have it listed in your items.py file and you set it
item[rating] = response.xpath(PATH).extact() # Why did you ad the url_next? huh?
....
# Standard convention for passing meta using call back is like this, this way
# allows you to pass multiple itemized item gets passed
yield scrapy.Request(abs_url, callback = self.parse_indetail, meta = {'item': item})
def parse_indetail(self,response):
# Then you must initialize the meta again in the function your passing it to
item = response.meta['item']
# Then you can continue your scraping
#items.py file
import scrapy
class TestimbdItem(scrapy.Item):
title = scrapy.Field()
directors = scrapy.Field()
writers = scrapy.Field()
stars = scrapy.Field()
popularity = scrapy.Field()
rating = scrapy.Field()
# The spider file
import scrapy
from testimbd.items import TestimbdItem
class ImbdsdpyderSpider(scrapy.Spider):
name = 'imbdsdpyder'
allowed_domains = ['imdb.com']
start_urls = ['http://www.imdb.com/chart/top']
def parse(self, response):
for href in response.css("td.titleColumn a::attr(href)").extract():
yield scrapy.Request(response.urljoin(href),
callback=self.parse_movie)
def parse_movie(self, response):
item = TestimbdItem()
item['title'] = [ x.replace('\xa0', '') for x in response.css(".title_wrapper h1::text").extract()][0]
item['directors'] = response.xpath('//div[@class="credit_summary_item"]/h4[contains(., "Director")]/following-sibling::a/text()').extract()
item['writers'] = response.xpath('//div[@class="credit_summary_item"]/h4[contains(., "Writers")]/following-sibling::a/text()').extract()
item['stars'] = response.xpath('//div[@class="credit_summary_item"]/h4[contains(., "Stars")]/following-sibling::a/text()').extract()
item['popularity'] = response.css(".titleReviewBarSubItem span.subText::text")[2].re('([0-9]+)')
item['rating'] = response.css(".ratingValue span::text").extract_first()
yield item
注意两件事: 标识parse()函数。我在这里所做的就是通过链接使用一个for循环,循环中的每个实例都引用href,并将urljoined href传递给解析器函数。给出您的用例,这绰绰有余。在具有下一页的情况下,它只是以某种方式为“下一个”页面创建一个变量并进行回调以进行解析,它将一直这样做,直到无法找到“下一个”页面为止。
第二,仅当HTML项中具有相同标签但内容不同的标签时,才使用xpath。这更多是个人观点,但我告诉人们xpath选择器就像解剖刀,而css选择器就像一把割肉刀。您可以使用手术刀来获得准确的结果,但需要花费更多的时间,并且在许多情况下,使用CSS选择器可能更容易获得相同的结果。