正则表达式,跳过几个单词

时间:2017-05-18 03:43:41

标签: python html regex

我很难在' alt'之间的引号之间找到文字。标签。我一直在尝试像[!?border =" 0"]这样的正则表达式来跳过它但仍然无法工作。

我已尝试\s(border="0")\s(alt=").*?",但它突出了边界'标签

这是我尝试使用正则表达式提取的文字

<img src="http://www.ebgames.com.au/0141/169/5.png"alt="Far Cry 3" title=" Far Cry 3 " class="photo"/>            </a>

我只是想在alt标记的引号之间提取文本。如果可能的话,提取标题可能会更好。 请帮忙,谢谢

5 个答案:

答案 0 :(得分:1)

试试这个正则表达式:

border=\"0\" alt=\"(.*?)\"

演示:https://regex101.com/r/1kbiBv/1/

你也可以实现积极前瞻,积极观察以仅捕捉引号之间的内容:

(?<=border=\"0\" alt=\").*?(?=\")

演示:https://regex101.com/r/1kbiBv/2/

答案 1 :(得分:0)

使用BeautifulSoup提取html元素和属性有更好的方法:

from bs4 import BeautifulSoup
div_test='<img src="http://rcdn-1.fishpond.com.au/0141/169/297/319967448/5.jpeg" border="0" alt="The Durrells: Series 2" title=" The Durrells: Series 2 " class="photo"/> '
soup = BeautifulSoup(div_test, "lxml")
result = soup.find("img").get('alt')
result

输出:

'The Durrells: Series 2'

答案 2 :(得分:0)

您可以使用lambda从当前输入中提取标记。

您可以尝试以下代码:

import re

a = '''<img src="http://rcdn-1.fishpond.com.au/0141/169/297/319967448/5.jpeg" border="0" alt="The Durrells: Series 2" title=" The Durrells: Series 2 " class="photo"/>            </a>
'''

find_tag = lambda x: r'{0}="(.*?)"'.format(x)
# Same as doing:
# regex = re.compile(find_tag('border="0" alt'))
regex = re.compile(find_tag("alt"))
text = re.findall(regex, a)
print(text)

输出:

['The Durrells: Series 2']

此外,此代码也适用于其他标记,例如:

regex = re.compile(find_tag("src"))
# Same as doing:
# regex = re.compile(find_tag('<img src'))
text = re.findall(regex, a)
print(text)

输出:

['http://rcdn-1.fishpond.com.au/0141/169/297/319967448/5.jpeg']

答案 3 :(得分:0)

我认为re.search有一个简单的正则表达式。

import re
s = '<img src="himg src="http://www.ebgames.com.au/0141/169/5.png" border="0" alt="Far Cry 3" title=" Far Cry 3 " class="photo"/>            </a>'
pat = 'alt="([^"]*)".* title="([^"]*)".*"'
a = re.search(pat, s)
print(a[1]) # content in the alt tag : "Far Cry 3"
print(a[2]) # content in the alt title : "Far Cry 3"

答案 4 :(得分:0)

此代码使用以下模式查找您需要的内容:'alt=".*?"'

 import re

 w ='<img src="http://rcdn-1.fishpond.com.au/0141/169/297/319967448/5.jpeg" border="0" alt="The 
 Durrells: Series 2" title=" The Durrells: Series 2 " class="photo"/>   </a>'

 pattern = 'alt=".*?"'
 m = re.findall(pattern, w)
 print(m)