Question

我很难在＆＃39; alt＆＃39;之间的引号之间找到文字。标签。我一直在尝试像[！？border =＆＃34; 0＆＃34;]这样的正则表达式来跳过它但仍然无法工作。

我已尝试\s(border="0")\s(alt=").*?"，但它突出了边界＆＃39;标签

这是我尝试使用正则表达式提取的文字

<img src="http://www.ebgames.com.au/0141/169/5.png"alt="Far Cry 3" title=" Far Cry 3 " class="photo"/>            </a>

我只是想在alt标记的引号之间提取文本。如果可能的话，提取标题可能会更好。请帮忙，谢谢

Answer 1

试试这个正则表达式：

border=\"0\" alt=\"(.*?)\"

演示：https://regex101.com/r/1kbiBv/1/

你也可以实现积极前瞻，积极观察以仅捕捉引号之间的内容：

(?<=border=\"0\" alt=\").*?(?=\")

演示：https://regex101.com/r/1kbiBv/2/

Answer 2

使用BeautifulSoup提取html元素和属性有更好的方法：

from bs4 import BeautifulSoup
div_test='<img src="http://rcdn-1.fishpond.com.au/0141/169/297/319967448/5.jpeg" border="0" alt="The Durrells: Series 2" title=" The Durrells: Series 2 " class="photo"/> '
soup = BeautifulSoup(div_test, "lxml")
result = soup.find("img").get('alt')
result

输出：

'The Durrells: Series 2'

Answer 3

您可以使用lambda从当前输入中提取标记。

您可以尝试以下代码：

import re

a = '''<img src="http://rcdn-1.fishpond.com.au/0141/169/297/319967448/5.jpeg" border="0" alt="The Durrells: Series 2" title=" The Durrells: Series 2 " class="photo"/>            </a>
'''

find_tag = lambda x: r'{0}="(.*?)"'.format(x)
# Same as doing:
# regex = re.compile(find_tag('border="0" alt'))
regex = re.compile(find_tag("alt"))
text = re.findall(regex, a)
print(text)

输出：

['The Durrells: Series 2']

此外，此代码也适用于其他标记，例如：

regex = re.compile(find_tag("src"))
# Same as doing:
# regex = re.compile(find_tag('<img src'))
text = re.findall(regex, a)
print(text)

输出：

['http://rcdn-1.fishpond.com.au/0141/169/297/319967448/5.jpeg']

Answer 4

我认为re.search有一个简单的正则表达式。

import re
s = '<img src="himg src="http://www.ebgames.com.au/0141/169/5.png" border="0" alt="Far Cry 3" title=" Far Cry 3 " class="photo"/>            </a>'
pat = 'alt="([^"]*)".* title="([^"]*)".*"'
a = re.search(pat, s)
print(a[1]) # content in the alt tag : "Far Cry 3"
print(a[2]) # content in the alt title : "Far Cry 3"

Answer 5

此代码使用以下模式查找您需要的内容：'alt=".*?"'。

 import re

 w ='<img src="http://rcdn-1.fishpond.com.au/0141/169/297/319967448/5.jpeg" border="0" alt="The 
 Durrells: Series 2" title=" The Durrells: Series 2 " class="photo"/>   </a>'

 pattern = 'alt=".*?"'
 m = re.findall(pattern, w)
 print(m)

正则表达式，跳过几个单词

5 个答案: