Question

我在Jupyter笔记本中运行了以下Python代码：

from lxml.html import parse
tree = parse('http://www.imdb.com/chart/top')
movies = tree.findall('.//table[@class="chart full-width"]//td[@class="titleColumn"]//a')

movies[0].text_content()

以上代码给出了以下输出：

'The Shawshank Redemption'

基本上，它是名为＆＃39; titleColumn＆＃39;的列的第一行的内容。在那个网页上。在同一张表格中，还有另一个名为“海报”的专栏＆＃39;其中包含缩略图。

现在我希望我的代码检索这些图像，输出也显示该图像。

我是否需要使用其他包来实现此目的？可以在Jupyter Notebook中显示图像吗？

Answer 1

要获取相关图像，您需要获取posterColumn。从这里你可以提取img src条目并拉出jpg图像。然后可以根据电影标题保存文件，小心删除任何无效的文件名字符，例如:：

from lxml.html import parse
import requests
import string

valid_chars = "-_.() " + string.ascii_letters + string.digits
tree = parse('http://www.imdb.com/chart/top')
movies = tree.findall('.//table[@class="chart full-width"]//td[@class="titleColumn"]//a')
posters = tree.findall('.//table[@class="chart full-width"]//td[@class="posterColumn"]//a')

for p, m in zip(posters, movies):
    for element, attribute, link, pos in p.iterlinks():
        if attribute == 'src':
            print "{:50} {}".format(m.text_content(), link)
            poster_jpg = requests.get(link, stream=True)
            valid_filename = ''.join(c for c in m.text_content() if c in valid_chars)

            with open('{}.jpg'.format(valid_filename), 'wb') as f_jpg:
                for chunk in poster_jpg:
                    f_jpg.write(chunk)

所以目前你会看到一些东西：

The Shawshank Redemption                           https://images-na.ssl-images-amazon.com/images/M/MV5BODU4MjU4NjIwNl5BMl5BanBnXkFtZTgwMDU2MjEyMDE@._V1_UY67_CR0,0,45,67_AL_.jpg
The Godfather                                      https://images-na.ssl-images-amazon.com/images/M/MV5BZTRmNjQ1ZDYtNDgzMy00OGE0LWE4N2YtNTkzNWQ5ZDhlNGJmL2ltYWdlL2ltYWdlXkEyXkFqcGdeQXVyNjU0OTQ0OTY@._V1_UY67_CR1,0,45,67_AL_.jpg
The Godfather: Part II                             https://images-na.ssl-images-amazon.com/images/M/MV5BMjZiNzIxNTQtNDc5Zi00YWY1LThkMTctMDgzYjY4YjI1YmQyL2ltYWdlL2ltYWdlXkEyXkFqcGdeQXVyNjU0OTQ0OTY@._V1_UY67_CR1,0,45,67_AL_.jpg

如何使用我的Python代码从网站检索位于表格中的图像？

1 个答案: