IMDB蜘蛛scrapy中错误的Xpath

时间:2017-07-14 15:31:25

标签: python xpath scrapy

下面: IMDB scrapy get all movie data

response.xpath(" // * [@类='结果'] / TR / TD [3]&#34)

返回空列表。我试着把它改成:

response.xpath(" // * [contains(@class,' chart full-width')] / tbody / tr")

没有成功。

请帮忙吗?感谢。

2 个答案:

答案 0 :(得分:0)

我没有时间彻底审查IMDB scrapy get all movie data,但我已经掌握了它的要点。问题陈述是从给定站点获取所有电影数据。它涉及两件事。 首先是浏览包含当年所有电影列表的所有网页。虽然第二一个是获取每部电影的链接,然后在这里你做自己的魔术。

您遇到的问题是获取每部电影链接的xpath。这很可能是由于网站结构的变化(我没有时间来验证可能存在的差异)。无论如何,以下是您需要的[^[:alpha:]]

第一:

我们将xpath班级div作为里程碑,并在其子级中找到nav班级。

lister-page-next next-page

这将给出:下一页的链接 |如果在最后一页(因为下一页标签不存在),则返回response.xpath("//div[@class='nav']/div/a[@class='lister-page-next next-page']/@href").extract_first()

第二次:

这是OP的原始疑问。

None

现在您需要做的就是遍历每个#Get the list of the container having the title, etc list = response.xpath("//div[@class='lister-item-content']") #From the container extract the required links paths = list.xpath("h3[@class='lister-item-header']/a/@href").extract() 元素并请求页面。

答案 1 :(得分:0)

感谢您的回答。我最终像你这样使用你的xPath:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

from crawler.items import MovieItem

IMDB_URL = "http://imdb.com"

class IMDBSpider(CrawlSpider):
    name = 'imdb'
    # in order to move the next page
    rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=("//div[@class='nav']/div/a[@class='lister-page-next next-page']",)),
                  callback="parse_page", follow= True),)

    def __init__(self, start=None, end=None, *args, **kwargs):
        super(IMDBSpider, self).__init__(*args, **kwargs)
        self.start_year = int(start) if start else 1874
        self.end_year = int(end) if end else 2017

    # generate start_urls dynamically
    def start_requests(self):
        for year in range(self.start_year, self.end_year+1):
            # movies are sorted by number of votes
            yield scrapy.Request('http://www.imdb.com/search/title?year={year},{year}&title_type=feature&sort=num_votes,desc'.format(year=year))

    def parse_page(self, response):
        content = response.xpath("//div[@class='lister-item-content']")
        paths = content.xpath("h3[@class='lister-item-header']/a/@href").extract() # list of paths of movies in the current page

        # all movies in this page
        for path in paths:
            item = MovieItem()
            item['MainPageUrl'] = IMDB_URL + path
            request = scrapy.Request(item['MainPageUrl'], callback=self.parse_movie_details)
            request.meta['item'] = item
            yield request

    # make sure that the start_urls are parsed as well
    parse_start_url = parse_page

    def parse_movie_details(self, response):
        pass # lots of parsing....

使用scrapy crawl imdb -a start=<start-year> -a end=<end-year>

运行它