我试图从网站上阅读图片。到目前为止,这是我的代码:
from bs4 import BeautifulSoup
import requests
url = 'https://www.basketball-reference.com/players/h/hardeja01.html'
page_request = requests.get(url)
soup = BeautifulSoup(page_request.text,"lxml")
img_src = soup.find("div", {"class": "media-item"})
print img_src
# <div class="media-item"><img alt="Photo of James Harden" itemscope="image" src="https://d2cwpp38twqe55.cloudfront.net/req/201804182/images/players/hardeja01.jpg"/>\n</div>
我对jpg图片的网址感兴趣。我可以写一些正则表达式来获得jpg,但必须有一些更简单的方法来做到这一点。
提取jpg网址的最佳方法是什么?
答案 0 :(得分:1)
您可以使用适用于select
的CSS selectors
方法:
img_src = soup.select_one('.media-item > img')['src']
您还可以试用Requests-HTML
:
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.basketball-reference.com/players/h/hardeja01.html')
>>> r.html.find('.media-item > img', first=True).attrs['src']
'https://d2cwpp38twqe55.cloudfront.net/req/201804182/images/players/hardeja01.jpg'
答案 1 :(得分:1)
您可以通过多种方式实现这一目标。这是一种这样的方法:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.basketball-reference.com/players/h/hardeja01.html")
soup = BeautifulSoup(page.text, 'html.parser')
image = soup.find(itemscope="image")['src']
print(image)
输出:
https://d2cwpp38twqe55.cloudfront.net/req/201804182/images/players/hardeja01.jpg
答案 2 :(得分:0)
有一个非常简单的解决方案:
ls = []
rows = 0
for key in data_dict:
for tempValue in data_dict[key]:
# print(tempValue)
ls.append([])
for (k, v) in tempValue.items():
ls[rows].append(key)
ls[rows].append(k)
ls[rows].append(v)
rows+=1
print(ls)