我正在尝试从html块中提取src的属性值,该html块是:
<img class="product-image first-image" src="https://cache.net-a-porter.com/images/products/1083507/1083507_in_pp.jpg">
我的代码是:
import requests
import json
from bs4 import BeautifulSoup
import re
headers = {'User-agent': 'Mozilla/5.0'}
url = 'https://www.net-a-porter.com/us/en/product/1083507/maje/layered-plaid-twill-and-stretch-cotton-jersey-top'
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
if url.find('net-a-porter')!=-1 :
i = soup.find_all('img', class_="product-image first-image")[0]["src"]
print i
我得到的结果:
//cache.net-a-porter.com/images/products/1083507/1083507_in_xs.jpg
但是我想获取原始html中确切的内容,
https://cache.net-aporter.com/images/products/1083507/1083507_in_pp.jpg
我的结果不同于原始的src
值,http:
消失了,1083507_in_pp
变成了1083507_in_xs
。我不知道为什么会这样,有人知道如何解决吗?谢谢!
答案 0 :(得分:0)
您已经很近了,但是,您需要从内置的"src"
键中访问attrs
键:
if url.find('net-a-porter')!=-1 :
i = soup.find_all('img', class_="product-image first-image")[0]
print i['src']