我需要获取给定字符串中的所有子字符串:
目前我有这个:
def find_between(s, first, last):
try:
start = s.index( first ) + len( first )
end = s.index( last, start )
return s[start:end]
except ValueError:
return ""
s = '<div class="figure"> <img src="http://127.0.0.1:8000/download_resource/4" title="vvv" alt="dovv" /><div class="figure"> <imgsrc="http://www.clickgratis.com.br/fotos-imagens/imagens/aHR0cHM6Ly9zaW1nLm1pbmhhdGVjYS5jb20uYnIvMjliN2RhMDQwMjE0ZmU2NTI2MzBkZjQwMztitle="olaaaa" alt="oi" /><p class="caption">oi</p> </div>'
print find_between( s, '<img src="', '"')
但是,它只提供找到的第一个子字符串。我怎样才能找到所有子串?
给定一个包含HTML的字符串,我需要从src
标签的img
获取URL,将这些图像转换为base64,然后返回带有base64版本图像的修改后的字符串。
这是我转换为base64的函数:
def get64(url):
a = base64.b64encode(requests.get(url).content)
return 'data:image/jpg;base64,' + a
这是理想的结果:
input: s = '<div class="figure"> <img src="http://127.0.0.1:8000/download_resource/4" title="vvv" alt="dovv" /><p
class="caption">dovv</p> </div> <p> <strong>ccdsfdsf</strong></p> <h3 id="fdsfdsfdsfdsfds">fdsfdsfdsfdsfds</h3>
<pre><code>fdsfdsfdsfdsfdsfds</code></pre> <pre class="sourceCode ruby"><code class="sourceCode ruby">require <span
class="st">'redcarpet'</span> markdown = <span class="dt">Redcarpet</span>.new(<span class="st">"Hello
World!"</span>) puts markdown.to_html</code></pre> <table> <thead> <tr class="header"> <th align="left">Tables</th> <th
align="center">Are</th> <th align="right">Cool</th> </tr> </thead> <tbody> <tr class="odd"> <td align="left">col 3 is</td> <td
align="center">right-aligned</td> <td align="right">$1600</td> </tr> <tr class="even"> <td align="left">col 2 is</td> <td
align="center">centered</td> <td align="right">$12</td> </tr> <tr class="odd"> <td align="left">zebra stripes</td> <td align="center">are
neat</td> <td align="right">$1</td> </tr> </tbody> </table> <div class="figure"> <img
src="http://www.clickgratis.com.br/fotos-imagens/imagens/aHR0cHM6Ly9zaW1nLm1pbmhhdGVjYS5jb20uYnIvMjliN2RhMDQwMjE0ZmU2NTI2MzBkZjQwMztitle="olaaaa" alt="oi" /><p class="caption">oi</p> </div>'
output: s = '<div class="figure"> <img src="RETURNED_GET64" title="vvv" alt="dovv" /><p
class="caption">dovv</p> </div> <p><strong>ccdsfdsf</strong></p> <h3 id="fdsfdsfdsfdsfds">fdsfdsfdsfdsfds</h3>
<pre><code>fdsfdsfdsfdsfdsfds</code></pre> <pre class="sourceCode ruby"><code class="sourceCode ruby">require <span
class="st">'redcarpet'</span> markdown = <span class="dt">Redcarpet</span>.new(<span class="st">"Hello
World!"</span>) puts markdown.to_html</code></pre> <table> <thead> <tr class="header"> <th align="left">Tables</th> <th
align="center">Are</th> <th align="right">Cool</th> </tr> </thead> <tbody> <tr class="odd"> <td align="left">col 3 is</td> <td
align="center">right-aligned</td> <td align="right">$1600</td> </tr> <tr class="even"> <td align="left">col 2 is</td> <td
align="center">centered</td> <td align="right">$12</td> </tr> <tr class="odd"> <td align="left">zebra stripes</td> <td align="center">are
neat</td> <td align="right">$1</td> </tr> </tbody> </table> <div class="figure"> <img
src="RETURNED_GET64" alt="oi" /><p class="caption">oi</p> </div>'
答案 0 :(得分:0)
您可以使用在字符串中找到(“)时打开和关闭的切换标记:
s = '<div class="figure"> <img src="http://127.0.0.1:8000/download_resource/4" title="vvv" alt="dovv" /><div class="figure"> <imgsrc="http://www.clickgratis.com.br/fotos-imagens/imagens/aHR0cHM6Ly9zaW1nLm1pbmhhdGVjYS5jb20uYnIvMjliN2RhMDQwMjE0ZmU2NTI2MzBkZjQwMztitle="olaaaa" alt="oi" /><p class="caption">oi</p> </div>'
begin_string = False
list_of_sub_strings = []
sub_string = ""
for letter in s:
if letter == "\"":
begin_string = not begin_string
if not begin_string:
list_of_sub_strings.append(sub_string)
sub_string = ""
if begin_string:
sub_string += letter
如您所见,每次找到引号时,子串开始读取字母,当找到另一个引号时,子串结束。如果找到第三个引号,第二个子串开始读取字母,依此类推。
现在,您只能使用URL的子字符串,方法是在for循环中使用if语句,只将所需的子字符串附加到子字符串列表中。
答案 1 :(得分:0)
您可以re module使用group by range(50)
来完成此操作。
re.findall
import re
s = '<div class="figure"> <img src="http://127.0.0.1:8000/download_resource/4" title="vvv" alt="dovv" /><div class="figure"> <imgsrc="http://www.clickgratis.com.br/fotos-imagens/imagens/aHR0cHM6Ly9zaW1nLm1pbmhhdGVjYS5jb20uYnIvMjliN2RhMDQwMjE0ZmU2NTI2MzBkZjQwMztitle="olaaaa" alt="oi" /><p class="caption">oi</p> </div>'
print re.findall("<img src=\"(.*?)\"", s)
会返回re.findall
和<img src=\"
之间所有文字匹配的列表(\"
周围的括号只匹配它之间的文字,而不是图片标记本身。)
使用示例输入中的.*?
:
s
要使用此替换字符串与base64图像,您可以使用>>> re.findall("<img src=\"(.*?)\"", s)
['http://127.0.0.1:8000/download_resource/4', 'http://www.clickgratis.com.br/fotos-imagens/imagens/aHR0cHM6Ly9zaW1nLm1pbmhhdGVjYS5jb20uYnIvMjliN2RhMDQwMjE0ZmU2NTI2MzBkZjQwMztitle=']
:
re.sub
这会将for image in re.findall("<img src=\"(.*?)\"", s):
s = re.sub(image, get64(image), s)
中的所有图像替换为其base64版本(由您的函数生成)。
所以完整的代码看起来像这样:
s
答案 2 :(得分:0)
我的推荐是BeautifulSoup
库。看起来您正在使用Python2,因此您应该可以使用python2 -m pip install beautifulsoup
安装并立即开始使用。
我对解决方案的快速破解看起来像这样:
from BeautifulSoup import BeautifulSoup
with open('whaver_filename.html') as f:
soup = BeautifulSoup(f.read())
for img in soup.findAll('img'):
# using your function from above
img['src'] = get64(img['src'])
with open('whatever_filename.html', 'w') as f:
f.write(str(soup))
BeautifulSoup
图书馆主页为here。如果您按照我上面建议的方式安装它,那么您将拥有版本3,其中包含更多文档here