我试图使用beautifulsoup刮取产品名称,描述,价格和图像。我有以下bs4.element.Tag类型,我想从标签中提取“src”链接。以下是我的标签:
df = <a class="itemImage" href="http://www.newegg.com/Product/Product.aspx?Item=N82E16875169194&cm_re=Samsung_edge-_-75-169-194-_-Product" id="img_75-169-194" title='Samsung Galaxy S7 Edge Dual SIM Unlocked Smart Phone, Dual Edge 5.5" AMOLED Display, black Color, 32GB Storage 4GB RAM International Version - No US Warranty'>\n<img alt='Samsung Galaxy S7 Edge Dual SIM Unlocked Smart Phone, Dual Edge 5.5" AMOLED Display, black Color, 32GB Storage 4GB RAM International Version - No US Warranty' src="http://images10.newegg.com/ProductImageCompressAll200/75-169-194-04.jpg" title='Samsung Galaxy S7 Edge Dual SIM Unlocked Smart Phone, Dual Edge 5.5" AMOLED Display, black Color, 32GB Storage 4GB RAM International Version - No US Warranty'/>\n</a>
如何提取
src="http://images10.newegg.com/ProductImageCompressAll200/75-169-194-04.jpg"
来自这个标签?我试过了
df.attrs['src']
但我收到了Keyerror。
答案 0 :(得分:1)
src位于 img 标记中:
join
哪个会给你:
from bs4 import BeautifulSoup
tag = """<a class="itemImage" href="http://www.newegg.com/Product/Product.aspx?Item=N82E16875169194&cm_re=Samsung_edge-_-75-169-194-_-Product" id="img_75-169-194" title='Samsung Galaxy S7 Edge Dual SIM Unlocked Smart Phone, Dual Edge 5.5" AMOLED Display, black Color, 32GB Storage 4GB RAM International Version - No US Warranty'>\n<img alt='Samsung Galaxy S7 Edge Dual SIM Unlocked Smart Phone, Dual Edge 5.5" AMOLED Display, black Color, 32GB Storage 4GB RAM International Version - No US Warranty' src="http://images10.newegg.com/ProductImageCompressAll200/75-169-194-04.jpg" title='Samsung Galaxy S7 Edge Dual SIM Unlocked Smart Phone, Dual Edge 5.5" AMOLED Display, black Color, 32GB Storage 4GB RAM International Version - No US Warranty'/>\n</a>"""
soup = BeautifulSoup(tag,"lxml")
src = soup.img["src"]
答案 1 :(得分:-1)
在python中尝试正则表达式
参考
https://docs.python.org/2/library/re.html
import re
s = """
<a class="itemImage" href="http://www.newegg.com/Product/Product.aspx?Item=N82E16875169194&cm_re=Samsung_edge-_-75-169-194-_-Product" id="img_75-169-194" title='Samsung Galaxy S7 Edge Dual SIM Unlocked Smart Phone, Dual Edge 5.5" AMOLED Display, black Color, 32GB Storage 4GB RAM International Version - No US Warranty'>\n<img alt='Samsung Galaxy S7 Edge Dual SIM Unlocked Smart Phone, Dual Edge 5.5" AMOLED Display, black Color, 32GB Storage 4GB RAM International Version - No US Warranty' src="http://images10.newegg.com/ProductImageCompressAll200/75-169-194-04.jpg" title='Samsung Galaxy S7 Edge Dual SIM Unlocked Smart Phone, Dual Edge 5.5" AMOLED Display, black Color, 32GB Storage 4GB RAM International Version - No US Warranty'/>\n</a>
"""
src_list = re.findall("src=[^\s]*", s)
输出:
src_list = ['src="http://images10.newegg.com/ProductImageCompressAll200/75-169-194-04.jpg"']