我正在Rotten Tomatoes网站上为example here做一些网页抓取工作。
我正在将Python与Beautiful Soup和lxml模块一起使用。
我想提取电影信息,例如: -类型:戏剧,音乐和表演艺术
导演:Kirill Serebrennikov
作者:Mikhail Idov,Lili Idova,Ivan Kapitonov,Kirill Serebrennikov,Natalya Naumenko
作者(链接):/ celebrity / michael_idov,/ celebrity / lily_idova,/ celebrity / ivan_kapitonov,/ celebrity / kirill_serebrennikov,/ celebrity / natalya_naumenko
我检查了html页面以获得有关路径的指南:
<li class="meta-row clearfix">
<div class="meta-label subtle">Rating: </div>
<div class="meta-value">NR</div>
</li>
<li class="meta-row clearfix">
<div class="meta-label subtle">Genre: </div>
<div class="meta-value">
<a href="/browse/opening/?genres=9">Drama</a>,
<a href="/browse/opening/?genres=12">Musical & Performing Arts</a>
</div>
</li>
<li class="meta-row clearfix">
<div class="meta-label subtle">Directed By: </div>
<div class="meta-value">
<a href="/celebrity/kirill_serebrennikov">Kirill Serebrennikov</a>
</div>
</li>
<li class="meta-row clearfix">
<div class="meta-label subtle">Written By: </div>
<div class="meta-value">
<a href="/celebrity/michael_idov">Mikhail Idov</a>,
<a href="/celebrity/lily_idova">Lili Idova</a>,
<a href="/celebrity/ivan_kapitonov">Ivan Kapitonov</a>,
<a href="/celebrity/kirill_serebrennikov">Kirill Serebrennikov</a>,
<a href="/celebrity/natalya_naumenko">Natalya Naumenko</a>
</div>
</li>
<li class="meta-row clearfix">
<div class="meta-label subtle">In Theaters: </div>
<div class="meta-value">
<time datetime="2019-06-06T17:00:00-07:00">Jun 7, 2019</time>
<span style="text-transform:capitalize"> limited</span>
</div>
</li>
<li class="meta-row clearfix">
<div class="meta-label subtle">Runtime: </div>
<div class="meta-value">
<time datetime="P126M">
126 minutes
</time>
</div>
</li>
<li class="meta-row clearfix">
<div class="meta-label subtle">Studio: </div>
<div class="meta-value">
<a href="http://sonypictures.ru/leto/" target="movie-studio">Gunpowder & Sky</a>
</div>
</li>
我创建了这样的html对象:
page_response = requests.get(url, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")
tree = html.fromstring(page_response.content)
例如,对于Writer,由于我只需要元素上的文本,就很容易获得:
page_content.select('div.meta-value')[3].getText()
或使用xpart进行评分:
tree.xpath('//div[@class="meta-value"]/text()')[0]
对于有问题的所需Writer Link,要访问html块,请执行以下操作:
page_content.select('div.meta-value')[3]
哪个给:
<div class="meta-value">
<a href="/celebrity/michael_idov">Mikhail Idov</a>,
<a href="/celebrity/lily_idova">Lili Idova</a>,
<a href="/celebrity/ivan_kapitonov">Ivan Kapitonov</a>,
<a href="/celebrity/kirill_serebrennikov">Kirill Serebrennikov</a>,
<a href="/celebrity/natalya_naumenko">Natalya Naumenko</a>
或者:
tree.xpath('//div[@class="meta-value"]')[3]
给予:
<Element div at 0x2915a4c54a8>
问题是我无法提取“ href”。我想要的输出是:
/celebrity/michael_idov, /celebrity/lily_idova, /celebrity/ivan_kapitonov, /celebrity/kirill_serebrennikov, /celebrity/natalya_naumenko
我尝试过:
page_content.select('div.meta-value')[3].get('href')
tree.xpath('//div[@class="meta-value"]')[3].get('href')
tree.xpath('//div[@class="meta-value"]/@href')[3]
全部为空或错误结果。 有人可以帮我吗?
提前谢谢! 干杯!
答案 0 :(得分:0)
请尝试以下脚本来获取您感兴趣的内容。请确保使用不同的影片对它们进行测试。我想他们俩都会产生想要的输出。我试图避免使用任何硬编码索引来定位内容。
使用CSS选择器:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.rottentomatoes.com/m/leto')
soup = BeautifulSoup(r.text,'lxml')
directed = soup.select_one(".meta-row:contains('Directed By') > .meta-value > a").text
written = [item.text for item in soup.select(".meta-row:contains('Written By') > .meta-value > a")]
written_links = [item.get("href") for item in soup.select(".meta-row:contains('Written By') > .meta-value > a")]
print(directed,written,written_links)
使用xpath:
import requests
from lxml.html import fromstring
r = requests.get('https://www.rottentomatoes.com/m/leto')
root = fromstring(r.text)
directed = root.xpath("//*[contains(.,'Directed By')]/parent::*/*[@class='meta-value']/a/text()")
written = root.xpath("//*[contains(.,'Written By')]/parent::*/*[@class='meta-value']/a/text()")
written_links = root.xpath(".//*[contains(.,'Written By')]/parent::*/*[@class='meta-value']/a//@href")
print(directed,written,written_links)
在进行强制转换的情况下,我使用了列表理解功能,因此可以在单个元素上使用.strip()
来消除空格。不过,normalize-space()
是理想的选择。
cast = [item.strip() for item in root.xpath("//*[contains(@class,'cast-item')]//a/span[@title]/text()")]