描述

Question

我需要从html文件中找到并提取图像源。例如，它可能包含：

<image class="logo" src="http://example.site/logo.jpg">

或

<img src="http://another.example/picture.png">

使用Python。我不想使用任何第三方程序。不过，我可以使用RE模块。该计划应该：

筛选所有内容
找出img或image代码
找到src并获取属性值（不带双引号）

这可能，如果可行，我该怎么办？我们可以假设我不需要访问互联网来执行此操作（我有一个名为website.html的文件，其中包含所有HTML代码）。

编辑：我目前的正则表达式是

r'<img[^>]*\ssrc="(.*?)"'

和

r'<image[^>]*\ssrc="(.*?)"'。

主要问题是表达式会拾取以img或image开头的任何内容。例如，如果有<imagesomethingrandom src="website">之类的内容，它仍会将其视为图像（因为单词图像位于开头）并且会添加源。

提前致谢。

罗布。

Answer 1

更改版本

<ima?ge? # using conditional letters, we match both tags in one expression
\s+      # require at least one space, also includes newlines which are valid
         # prevents <imgbutnotreally> tags
[^>]*?   # similar to the above, but tell it not to be greedy (performance)
\bsrc="([^"]+) # match a space and find all characters in the src tag

rubular

<ima?ge?\s+[^>]*?\src="([^"]+)

Answer 2

尝试BeautifulSoup，只需写下

from bs4 import BeautifulSoup    
soup = BeautifulSoup(theHTMLtext)
imagesElements = soup.find_all('img')

Answer 3

描述

此表达式将：

查找具有image属性的所有img和src标记
忽略不是图片或img的标签，例如imagesomethingrandom
捕获src属性的值
正确处理单引号，双引号或非引用属性值
避免大多数棘手的边缘情况，这些情况在匹配html

<ima?ge?(?=\s|>)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\ssrc=(['"]?)(.*?)\1(?:\s|>))(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>

enter image description here

实施例

Live Regex Demo
Live Python Demo

示例文字

注意第一行中相当困难的边缘情况

<img onmouseover=' src="NotTheDroidsYouAreLookingFor.png" ; if (x > 3) { funRotate(src); } ' src="http://another.example/picture.png">
<imagesomethingrandom class="logo" src="http://example.site/imagesomethingrandom.jpg">
<image class="logo" src="http://example.site/logo.jpg">
<img src="http://another.example/DoubleQuoted.png">
<image src='http://another.example/SingleQuoted.png'>
<img src=http://another.example/NotQuoted.png>

Python代码

#!/usr/bin/python
import re

string = """<img onmouseover=' src="NotTheDroidsYouAreLookingFor.png" ; if (x > 3) { funRotate(src); } ' src="http://another.example/picture.png">
<imagesomethingrandom class="logo" src="http://example.site/imagesomethingrandom.jpg">
<image class="logo" src="http://example.site/logo.jpg">
<img src="http://another.example/DoubleQuoted.png">
<image src='http://another.example/SingleQuoted.png'>
<img src=http://another.example/NotQuoted.png>
""";

regex = r"""<ima?ge?(?=\s|>)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\ssrc=(['"]?)(.*?)\1(?:\s|>))(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>""";

intCount = 0

for matchObj in re.finditer( regex, string, re.M|re.I|re.S):
    print " "
    print "[", intCount, "][ 0 ] : ", matchObj.group(0)
    print "[", intCount, "][ 1 ] : ", matchObj.group(1)
    print "[", intCount, "][ 2 ] : ", matchObj.group(2)
    intCount+=1

捕获论坛

组0获取整个图像或img标签
第1组获取包围src属性的引用（如果存在）第2组获取src属性值

[ 0 ][ 0 ] :  <img onmouseover=' src="NotTheDroidsYouAreLookingFor.png" ; if (x > 3) { funRotate(src); } ' src="http://another.example/picture.png">
[ 0 ][ 1 ] :  "
[ 0 ][ 2 ] :  http://another.example/picture.png

[ 1 ][ 0 ] :  <image class="logo" src="http://example.site/logo.jpg">
[ 1 ][ 1 ] :  "
[ 1 ][ 2 ] :  http://example.site/logo.jpg

[ 2 ][ 0 ] :  <img src="http://another.example/DoubleQuoted.png">
[ 2 ][ 1 ] :  "
[ 2 ][ 2 ] :  http://another.example/DoubleQuoted.png

[ 3 ][ 0 ] :  <image src='http://another.example/SingleQuoted.png'>
[ 3 ][ 1 ] :  '
[ 3 ][ 2 ] :  http://another.example/SingleQuoted.png

[ 4 ][ 0 ] :  <img src=http://another.example/NotQuoted.png>
[ 4 ][ 1 ] :  
[ 4 ][ 2 ] :  http://another.example/NotQuoted.png

Python 3.3.2 - 在HTML中查找图像源

3 个答案:

rubular

描述

实施例