<div class="someClass">
<a href="href">
<img alt="some" src="some"/>
</a>
</div>
我使用bs4而我无法使用a.attrs['src']
来获取src
,但我可以获得href
。我该怎么办?
答案 0 :(得分:16)
您可以使用BeautifulSoup
提取src
代码的html img
属性。在我的示例中,htmlText
包含img
标记本身,但也可以将其与urllib2
一起用于网址。
适用于网址
from BeautifulSoup import BeautifulSoup as BSHTML
import urllib2
page = urllib2.urlopen('http://www.youtube.com/')
soup = BSHTML(page)
images = soup.findAll('img')
for image in images:
#print image source
print image['src']
#print alternate text
print image['alt']
对于带有img标签的文本
from BeautifulSoup import BeautifulSoup as BSHTML
htmlText = """<img src="https://src1.com/" <img src="https://src2.com/" /> """
soup = BSHTML(htmlText)
images = soup.findAll('img')
for image in images:
print image['src']
答案 1 :(得分:6)
链接没有属性src
,您必须定位实际的img
代码。
import bs4
html = """<div class="someClass">
<a href="href">
<img alt="some" src="some"/>
</a>
</div>"""
soup = bs4.BeautifulSoup(html, "html.parser")
# this will return src attrib from img tag that is inside 'a' tag
soup.a.img['src']
>>> 'some'
# if you have more then one 'a' tag
for a in soup.find_all('a'):
if a.img:
print(a.img['src'])
>>> 'some'
答案 2 :(得分:1)
您可以使用BeautifulSoup提取html img标签的src属性。在我的示例中,htmlText包含img标记本身,但是它也可以与urllib2一起用于URL。
最受好评的答案提供的解决方案不适用于python3。这是正确的实现:
对于URL
from bs4 import BeautifulSoup as BSHTML
import urllib3
http = urllib3.PoolManager()
url = 'your_url'
response = http.request('GET', url)
soup = BSHTML(response.data, "html.parser")
images = soup.findAll('img')
for image in images:
#print image source
print(image['src'])
#print alternate text
print(image['alt'])
用于带有img标签的文本
from bs4 import BeautifulSoup as BSHTML
htmlText = """<img src="https://src1.com/" <img src="https://src2.com/" /> """
soup = BSHTML(htmlText)
images = soup.findAll('img')
for image in images:
print(image['src'])
答案 3 :(得分:1)
这是一个解决方案,如果img标签没有src属性,则不会触发KeyError:
from urllib.request import urlopen
from bs4 import BeautifulSoup
site = "[insert name of the site]"
html = urlopen(site)
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img')
for img in images:
if img.has_attr('src'):
print(img['src'])