不使用scrapy内的请求就无法解析自定义结果

时间:2019-12-01 15:13:42

标签: python python-3.x web-scraping scrapy

我创建了一个脚本,使用scrapy从imdb.com提取与不同演员的名字相关的所有链接,然后解析其电影链接的前三个,最后将名字director和{这些电影中的{1}}。如果我坚持目前的尝试,我的脚本会做到完美无缺。但是,我在writer方法中使用了requests模块(我不想使用)来获取自定义输出。

website address

脚本的作用(考虑第一个命名的链接,如parse_results):

  1. 脚本使用上面的URL并刮取命名的链接,以解析位于标题Robert De Niro下的here的前三个电影链接。

  2. 然后从here解析Filmographydirectors的名称

这是我到目前为止写的(工作的):

writers

输出上述脚本产生的(期望的):

import scrapy
import requests
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess

class ImdbSpider(scrapy.Spider):
    name = 'imdb'
    start_urls = ['https://www.imdb.com/list/ls058011111/']

    def parse(self, response):
        soup = BeautifulSoup(response.text,"lxml")
        for name_links in soup.select(".mode-detail")[:10]:
            name = name_links.select_one("h3 > a").get_text(strip=True)
            item_link = response.urljoin(name_links.select_one("h3 > a").get("href"))
            yield scrapy.Request(item_link,meta={"name":name},callback=self.parse_items)

    def parse_items(self,response):
        name = response.meta.get("name")
        soup = BeautifulSoup(response.text,"lxml")
        item_links = [response.urljoin(item.get("href")) for item in soup.select(".filmo-category-section .filmo-row > b > a[href]")[:3]]
        result_list = [i for url in item_links for i in self.parse_results(url)]
        yield {"actor name":name,"associated name list":result_list}

    def parse_results(self,link):
        response = requests.get(link)
        soup = BeautifulSoup(response.text,"lxml")
        try:
            director = soup.select_one("h4:contains('Director') ~ a").get_text(strip=True)
        except Exception as e: director = ""
        try:
            writer = soup.select_one("h4:contains('Writer') ~ a").get_text(strip=True)
        except Exception as e: writer = ""
        return director,writer


c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',

})
c.crawl(ImdbSpider)
c.start()

在上述方法中,我在{'actor name': 'Robert De Niro', 'associated name list': ['Jonathan Jakubowicz', 'Jonathan Jakubowicz', '', 'Anthony Thorne', 'Martin Scorsese', 'David Grann']} {'actor name': 'Sidney Poitier', 'associated name list': ['Gregg Champion', 'Richard Leder', 'Gregg Champion', 'Sterling Anderson', 'Lloyd Kramer', 'Theodore Isaac Rubin']} {'actor name': 'Daniel Day-Lewis', 'associated name list': ['Paul Thomas Anderson', 'Paul Thomas Anderson', 'Paul Thomas Anderson', 'Paul Thomas Anderson', 'Steven Spielberg', 'Tony Kushner']} {'actor name': 'Humphrey Bogart', 'associated name list': ['', '', 'Mark Robson', 'Philip Yordan', 'William Wyler', 'Joseph Hayes']} {'actor name': 'Gregory Peck', 'associated name list': ['', '', 'Arthur Penn', 'Tina Howe', 'Walter C. Miller', 'Peter Stone']} {'actor name': 'Denzel Washington', 'associated name list': ['Joel Coen', 'Joel Coen', 'John Lee Hancock', 'John Lee Hancock', 'Antoine Fuqua', 'Richard Wenk']} 方法中使用requests模块来获取所需的输出,因为在任何列表理解中都无法使用parse_results

如何在不使用yield的情况下让脚本产生准确的输出?

1 个答案:

答案 0 :(得分:1)

解决此问题的一种方法是使用Request.meta保留请求中某项的待处理URL列表,并从中弹出URL。

如@pguardiario所述,缺点是您仍然一次只处理该列表中的一个请求。但是,如果您的项目多于配置的并发性,那应该不是问题。

这种方法看起来像这样:

def parse_items(self,response):
    # …
    if item_links:
        meta = {
            "actor name": name,
            "associated name list": [],
            "item_links": item_links,
        }
        yield Request(
            item_links.pop(),
            callback=self.parse_results,
            meta=meta
        )
    else:
        yield {"actor name": name}

def parse_results(self, response):
    # …
    response.meta["associated name list"].append((director, writer))
    if response.meta["item_links"]:
        yield Request(
            response.meta["item_links"].pop(),
            callback=self.parse_results,
            meta=response.meta
        )
    else:
        yield {
            "actor name": response.meta["actor name"],
            "associated name list": response.meta["associated name list"],
        }