Question

我在文件中有一些字符串，如下所示：

line1    <img alt="Powered by MediaWiki" height="31" src="/static/images/poweredby_mediawiki_88x31.png" srcset="/static/images/poweredby_mediawiki_132x47.png 1.5x, /static/images/poweredby_mediawiki_176x62.png 2x" width="88"/>
line2    '<img alt="" class="wp-image-141 size-large" height="591" sizes="(max-width: 788px) 100vw, 788px" src="https://alessandrorossini.org/wp-content/2018/07/20180619_151349-1024x768.jpg" srcset="https://alessandrorossini.org/wp-content/2018/07/20180619_151349-1024x768.jpg 1024w, https://alessandrorossini.org/wp-content/2018/07/20180619_151349-300x225.jpg 300w, https://alessandrorossini.org/wp-content/2018/07/20180619_151349-788x591.jpg 788w" width="788"/>

我想读取每一行的高度值（例如：第1行的31和第2行的591）。

我该怎么做？

Answer 1

为使下面的代码运行，我将您的两行放入了一个名为file_name.html的文件中。这是提取height的值的两种方法。

With BeautifulSoup

from bs4 import BeautifulSoup

with open('file_name.html', 'r') as f:
    soup = BeautifulSoup(f, 'html5lib')
    for img_tag in soup.find_all('img'):
        print(img_tag.get('height'))

使用正则表达式

import re

with open('file_name.html', 'r') as f:
    lines = f.readlines()
    regex = '(height=")(\d*)(")'  # 2nd regex group captures the value of height
    heights = [re.search(regex, l).group(2) for l in lines]
    print(heights)

请注意，这个特定的Regex示例仅捕获每行的第一个height值。

从字符串中读取变量的值

1 个答案: