我在文件中有一些字符串,如下所示:
line1 <img alt="Powered by MediaWiki" height="31" src="/static/images/poweredby_mediawiki_88x31.png" srcset="/static/images/poweredby_mediawiki_132x47.png 1.5x, /static/images/poweredby_mediawiki_176x62.png 2x" width="88"/>
line2 '<img alt="" class="wp-image-141 size-large" height="591" sizes="(max-width: 788px) 100vw, 788px" src="https://alessandrorossini.org/wp-content/2018/07/20180619_151349-1024x768.jpg" srcset="https://alessandrorossini.org/wp-content/2018/07/20180619_151349-1024x768.jpg 1024w, https://alessandrorossini.org/wp-content/2018/07/20180619_151349-300x225.jpg 300w, https://alessandrorossini.org/wp-content/2018/07/20180619_151349-788x591.jpg 788w" width="788"/>
我想读取每一行的高度值(例如:第1行的31和第2行的591)。
我该怎么做?
答案 0 :(得分:1)
为使下面的代码运行,我将您的两行放入了一个名为file_name.html
的文件中。这是提取height
的值的两种方法。
With BeautifulSoup
from bs4 import BeautifulSoup
with open('file_name.html', 'r') as f:
soup = BeautifulSoup(f, 'html5lib')
for img_tag in soup.find_all('img'):
print(img_tag.get('height'))
使用正则表达式
import re
with open('file_name.html', 'r') as f:
lines = f.readlines()
regex = '(height=")(\d*)(")' # 2nd regex group captures the value of height
heights = [re.search(regex, l).group(2) for l in lines]
print(heights)
请注意,这个特定的Regex示例仅捕获每行的第一个height
值。