使用beautifulsoup阅读图片网址

时间:2018-04-22 00:11:06

标签: python-2.7 beautifulsoup

我试图从网站上阅读图片。到目前为止,这是我的代码:

from bs4 import BeautifulSoup
import requests

url = 'https://www.basketball-reference.com/players/h/hardeja01.html'
page_request = requests.get(url)
soup = BeautifulSoup(page_request.text,"lxml")
img_src = soup.find("div", {"class": "media-item"})
print img_src
# <div class="media-item"><img alt="Photo of James Harden" itemscope="image" src="https://d2cwpp38twqe55.cloudfront.net/req/201804182/images/players/hardeja01.jpg"/>\n</div>

我对jpg图片的网址感兴趣。我可以写一些正则表达式来获得jpg,但必须有一些更简单的方法来做到这一点。

提取jpg网址的最佳方法是什么?

3 个答案:

答案 0 :(得分:1)

您可以使用适用于selectCSS selectors方法:

img_src = soup.select_one('.media-item > img')['src']

您还可以试用Requests-HTML

from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.basketball-reference.com/players/h/hardeja01.html')
>>> r.html.find('.media-item > img', first=True).attrs['src']
'https://d2cwpp38twqe55.cloudfront.net/req/201804182/images/players/hardeja01.jpg'

答案 1 :(得分:1)

您可以通过多种方式实现这一目标。这是一种这样的方法:

import requests
from bs4 import BeautifulSoup

page = requests.get("https://www.basketball-reference.com/players/h/hardeja01.html")
soup = BeautifulSoup(page.text, 'html.parser')
image = soup.find(itemscope="image")['src']
print(image)

输出:

https://d2cwpp38twqe55.cloudfront.net/req/201804182/images/players/hardeja01.jpg

答案 2 :(得分:0)

有一个非常简单的解决方案:

    ls = []
rows = 0
for key in data_dict:
    for tempValue in data_dict[key]:
        # print(tempValue)
        ls.append([])
        for (k, v) in tempValue.items():
            ls[rows].append(key)
            ls[rows].append(k)
            ls[rows].append(v)
        rows+=1
print(ls)