Question

我需要获取给定字符串中的所有子字符串：

目前我有这个：

    def find_between(s, first, last):
    try:
        start = s.index( first ) + len( first )
        end = s.index( last, start )
        return s[start:end]
    except ValueError:
        return ""

s = '<div class="figure"> <img src="http://127.0.0.1:8000/download_resource/4" title="vvv" alt="dovv" /><div class="figure"> <imgsrc="http://www.clickgratis.com.br/fotos-imagens/imagens/aHR0cHM6Ly9zaW1nLm1pbmhhdGVjYS5jb20uYnIvMjliN2RhMDQwMjE0ZmU2NTI2MzBkZjQwMztitle="olaaaa" alt="oi" /><p class="caption">oi</p> </div>'

print find_between( s, '<img src="', '"')

但是，它只提供找到的第一个子字符串。我怎样才能找到所有子串？

给定一个包含HTML的字符串，我需要从src标签的img获取URL，将这些图像转换为base64，然后返回带有base64版本图像的修改后的字符串。

这是我转换为base64的函数：

def get64(url):
    a = base64.b64encode(requests.get(url).content)
    return 'data:image/jpg;base64,' + a

这是理想的结果：

 input: s = '<div class="figure"> <img src="http://127.0.0.1:8000/download_resource/4" title="vvv" alt="dovv" /><p
class="caption">dovv</p> </div> <p> <strong>ccdsfdsf</strong></p> <h3 id="fdsfdsfdsfdsfds">fdsfdsfdsfdsfds</h3>
<pre><code>fdsfdsfdsfdsfdsfds</code></pre> <pre class="sourceCode ruby"><code class="sourceCode ruby">require <span
class="st">&#39;redcarpet&#39;</span> markdown = <span class="dt">Redcarpet</span>.new(<span class="st">&quot;Hello
World!&quot;</span>) puts markdown.to_html</code></pre> <table> <thead> <tr class="header"> <th align="left">Tables</th> <th
align="center">Are</th> <th align="right">Cool</th> </tr> </thead> <tbody> <tr class="odd"> <td align="left">col 3 is</td> <td
align="center">right-aligned</td> <td align="right">$1600</td> </tr> <tr class="even"> <td align="left">col 2 is</td> <td
align="center">centered</td> <td align="right">$12</td> </tr> <tr class="odd"> <td align="left">zebra stripes</td> <td align="center">are
neat</td> <td align="right">$1</td> </tr> </tbody> </table> <div class="figure"> <img
src="http://www.clickgratis.com.br/fotos-imagens/imagens/aHR0cHM6Ly9zaW1nLm1pbmhhdGVjYS5jb20uYnIvMjliN2RhMDQwMjE0ZmU2NTI2MzBkZjQwMztitle="olaaaa" alt="oi" /><p class="caption">oi</p> </div>'

    output: s = '<div class="figure"> <img src="RETURNED_GET64" title="vvv" alt="dovv" /><p
class="caption">dovv</p> </div> <p><strong>ccdsfdsf</strong></p> <h3 id="fdsfdsfdsfdsfds">fdsfdsfdsfdsfds</h3>
<pre><code>fdsfdsfdsfdsfdsfds</code></pre> <pre class="sourceCode ruby"><code class="sourceCode ruby">require <span
class="st">&#39;redcarpet&#39;</span> markdown = <span class="dt">Redcarpet</span>.new(<span class="st">&quot;Hello
World!&quot;</span>) puts markdown.to_html</code></pre> <table> <thead> <tr class="header"> <th align="left">Tables</th> <th
align="center">Are</th> <th align="right">Cool</th> </tr> </thead> <tbody> <tr class="odd"> <td align="left">col 3 is</td> <td
align="center">right-aligned</td> <td align="right">$1600</td> </tr> <tr class="even"> <td align="left">col 2 is</td> <td
align="center">centered</td> <td align="right">$12</td> </tr> <tr class="odd"> <td align="left">zebra stripes</td> <td align="center">are
neat</td> <td align="right">$1</td> </tr> </tbody> </table> <div class="figure"> <img
src="RETURNED_GET64" alt="oi" /><p class="caption">oi</p> </div>'

Answer 1

您可以使用在字符串中找到（“）时打开和关闭的切换标记：

s = '<div class="figure"> <img src="http://127.0.0.1:8000/download_resource/4" title="vvv" alt="dovv" /><div class="figure"> <imgsrc="http://www.clickgratis.com.br/fotos-imagens/imagens/aHR0cHM6Ly9zaW1nLm1pbmhhdGVjYS5jb20uYnIvMjliN2RhMDQwMjE0ZmU2NTI2MzBkZjQwMztitle="olaaaa" alt="oi" /><p class="caption">oi</p> </div>'

begin_string = False
list_of_sub_strings = []
sub_string = ""

for letter  in s:
    if letter == "\"":
        begin_string = not begin_string
        if not begin_string:
            list_of_sub_strings.append(sub_string)
            sub_string = ""

    if begin_string:
        sub_string += letter

如您所见，每次找到引号时，子串开始读取字母，当找到另一个引号时，子串结束。如果找到第三个引号，第二个子串开始读取字母，依此类推。

现在，您只能使用URL的子字符串，方法是在for循环中使用if语句，只将所需的子字符串附加到子字符串列表中。

Answer 2

您可以re module使用group by range(50)来完成此操作。

re.findall

import re s = '<div class="figure"> <img src="http://127.0.0.1:8000/download_resource/4" title="vvv" alt="dovv" /><div class="figure"> <imgsrc="http://www.clickgratis.com.br/fotos-imagens/imagens/aHR0cHM6Ly9zaW1nLm1pbmhhdGVjYS5jb20uYnIvMjliN2RhMDQwMjE0ZmU2NTI2MzBkZjQwMztitle="olaaaa" alt="oi" /><p class="caption">oi</p> </div>' print re.findall("<img src=\"(.*?)\"", s)会返回re.findall和<img src=\"之间所有文字匹配的列表（\"周围的括号只匹配它之间的文字，而不是图片标记本身。）

使用示例输入中的.*?：

要使用此替换字符串与base64图像，您可以使用>>> re.findall("<img src=\"(.*?)\"", s) ['http://127.0.0.1:8000/download_resource/4', 'http://www.clickgratis.com.br/fotos-imagens/imagens/aHR0cHM6Ly9zaW1nLm1pbmhhdGVjYS5jb20uYnIvMjliN2RhMDQwMjE0ZmU2NTI2MzBkZjQwMztitle=']：

re.sub

这会将for image in re.findall("<img src=\"(.*?)\"", s): s = re.sub(image, get64(image), s)中的所有图像替换为其base64版本（由您的函数生成）。

所以完整的代码看起来像这样：

Answer 3

我的推荐是BeautifulSoup库。看起来您正在使用Python2，因此您应该可以使用python2 -m pip install beautifulsoup安装并立即开始使用。

我对解决方案的快速破解看起来像这样：

from BeautifulSoup import BeautifulSoup

with open('whaver_filename.html') as f:
    soup = BeautifulSoup(f.read())

for img in soup.findAll('img'):
    # using your function from above
    img['src'] = get64(img['src'])

with open('whatever_filename.html', 'w') as f:
    f.write(str(soup))

BeautifulSoup图书馆主页为here。如果您按照我上面建议的方式安装它，那么您将拥有版本3，其中包含更多文档here

获取字符串python中的所有子字符串

3 个答案: