我创建了一个脚本,使用scrapy从imdb.com提取与不同演员的名字相关的所有链接,然后解析其电影链接的前三个,最后将名字director
和{这些电影中的{1}}。如果我坚持目前的尝试,我的脚本会做到完美无缺。但是,我在writer
方法中使用了requests
模块(我不想使用)来获取自定义输出。
脚本的作用(考虑第一个命名的链接,如parse_results
):
这是我到目前为止写的(工作的):
writers
输出上述脚本产生的(期望的):
import scrapy
import requests
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess
class ImdbSpider(scrapy.Spider):
name = 'imdb'
start_urls = ['https://www.imdb.com/list/ls058011111/']
def parse(self, response):
soup = BeautifulSoup(response.text,"lxml")
for name_links in soup.select(".mode-detail")[:10]:
name = name_links.select_one("h3 > a").get_text(strip=True)
item_link = response.urljoin(name_links.select_one("h3 > a").get("href"))
yield scrapy.Request(item_link,meta={"name":name},callback=self.parse_items)
def parse_items(self,response):
name = response.meta.get("name")
soup = BeautifulSoup(response.text,"lxml")
item_links = [response.urljoin(item.get("href")) for item in soup.select(".filmo-category-section .filmo-row > b > a[href]")[:3]]
result_list = [i for url in item_links for i in self.parse_results(url)]
yield {"actor name":name,"associated name list":result_list}
def parse_results(self,link):
response = requests.get(link)
soup = BeautifulSoup(response.text,"lxml")
try:
director = soup.select_one("h4:contains('Director') ~ a").get_text(strip=True)
except Exception as e: director = ""
try:
writer = soup.select_one("h4:contains('Writer') ~ a").get_text(strip=True)
except Exception as e: writer = ""
return director,writer
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(ImdbSpider)
c.start()
在上述方法中,我在{'actor name': 'Robert De Niro', 'associated name list': ['Jonathan Jakubowicz', 'Jonathan Jakubowicz', '', 'Anthony Thorne', 'Martin Scorsese', 'David Grann']}
{'actor name': 'Sidney Poitier', 'associated name list': ['Gregg Champion', 'Richard Leder', 'Gregg Champion', 'Sterling Anderson', 'Lloyd Kramer', 'Theodore Isaac Rubin']}
{'actor name': 'Daniel Day-Lewis', 'associated name list': ['Paul Thomas Anderson', 'Paul Thomas Anderson', 'Paul Thomas Anderson', 'Paul Thomas Anderson', 'Steven Spielberg', 'Tony Kushner']}
{'actor name': 'Humphrey Bogart', 'associated name list': ['', '', 'Mark Robson', 'Philip Yordan', 'William Wyler', 'Joseph Hayes']}
{'actor name': 'Gregory Peck', 'associated name list': ['', '', 'Arthur Penn', 'Tina Howe', 'Walter C. Miller', 'Peter Stone']}
{'actor name': 'Denzel Washington', 'associated name list': ['Joel Coen', 'Joel Coen', 'John Lee Hancock', 'John Lee Hancock', 'Antoine Fuqua', 'Richard Wenk']}
方法中使用requests
模块来获取所需的输出,因为在任何列表理解中都无法使用parse_results
。
如何在不使用yield
的情况下让脚本产生准确的输出?
答案 0 :(得分:1)
解决此问题的一种方法是使用Request.meta
保留请求中某项的待处理URL列表,并从中弹出URL。
如@pguardiario所述,缺点是您仍然一次只处理该列表中的一个请求。但是,如果您的项目多于配置的并发性,那应该不是问题。
这种方法看起来像这样:
def parse_items(self,response):
# …
if item_links:
meta = {
"actor name": name,
"associated name list": [],
"item_links": item_links,
}
yield Request(
item_links.pop(),
callback=self.parse_results,
meta=meta
)
else:
yield {"actor name": name}
def parse_results(self, response):
# …
response.meta["associated name list"].append((director, writer))
if response.meta["item_links"]:
yield Request(
response.meta["item_links"].pop(),
callback=self.parse_results,
meta=response.meta
)
else:
yield {
"actor name": response.meta["actor name"],
"associated name list": response.meta["associated name list"],
}