我在Jupyter笔记本中运行了以下Python代码:
from lxml.html import parse
tree = parse('http://www.imdb.com/chart/top')
movies = tree.findall('.//table[@class="chart full-width"]//td[@class="titleColumn"]//a')
movies[0].text_content()
以上代码给出了以下输出:
'The Shawshank Redemption'
基本上,它是名为' titleColumn'的列的第一行的内容。在那个网页上。在同一张表格中,还有另一个名为“海报”的专栏'其中包含缩略图。
现在我希望我的代码检索这些图像,输出也显示该图像。
我是否需要使用其他包来实现此目的?可以在Jupyter Notebook中显示图像吗?
答案 0 :(得分:0)
要获取相关图像,您需要获取posterColumn
。从这里你可以提取img src
条目并拉出jpg图像。然后可以根据电影标题保存文件,小心删除任何无效的文件名字符,例如:
:
from lxml.html import parse
import requests
import string
valid_chars = "-_.() " + string.ascii_letters + string.digits
tree = parse('http://www.imdb.com/chart/top')
movies = tree.findall('.//table[@class="chart full-width"]//td[@class="titleColumn"]//a')
posters = tree.findall('.//table[@class="chart full-width"]//td[@class="posterColumn"]//a')
for p, m in zip(posters, movies):
for element, attribute, link, pos in p.iterlinks():
if attribute == 'src':
print "{:50} {}".format(m.text_content(), link)
poster_jpg = requests.get(link, stream=True)
valid_filename = ''.join(c for c in m.text_content() if c in valid_chars)
with open('{}.jpg'.format(valid_filename), 'wb') as f_jpg:
for chunk in poster_jpg:
f_jpg.write(chunk)
所以目前你会看到一些东西:
The Shawshank Redemption https://images-na.ssl-images-amazon.com/images/M/MV5BODU4MjU4NjIwNl5BMl5BanBnXkFtZTgwMDU2MjEyMDE@._V1_UY67_CR0,0,45,67_AL_.jpg
The Godfather https://images-na.ssl-images-amazon.com/images/M/MV5BZTRmNjQ1ZDYtNDgzMy00OGE0LWE4N2YtNTkzNWQ5ZDhlNGJmL2ltYWdlL2ltYWdlXkEyXkFqcGdeQXVyNjU0OTQ0OTY@._V1_UY67_CR1,0,45,67_AL_.jpg
The Godfather: Part II https://images-na.ssl-images-amazon.com/images/M/MV5BMjZiNzIxNTQtNDc5Zi00YWY1LThkMTctMDgzYjY4YjI1YmQyL2ltYWdlL2ltYWdlXkEyXkFqcGdeQXVyNjU0OTQ0OTY@._V1_UY67_CR1,0,45,67_AL_.jpg