如何使用正则表达式提取特定的img src url格式?

时间:2016-06-08 12:36:51

标签: python regex url extract src

我的字符串:

Russia's National Settlement Depository discusses why it believes the biggest blockchain opportunities have yet to be uncovered.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw" width="1" />|One of the co-founder of digital currency startup Stellar announced their resignation today.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0" width="1" />|The editorial board for Bloomberg News has called for a permissive regulatory environment for blockchain development.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8" width="1" />|

我想把这3个链接放到一个列表中:

http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw
http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0
http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8

他们遵守这种模式:

src="http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw"

我知道我应该使用re.findall(pattern, string)来实现这一目标。

但最重要的问题是:如何构建适用于此处的模式?

我在编写正则表达式方面并不擅长......我总是感到困惑......几乎完成工作的是这一个:

pattern = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

但我得到的只是这个清单:

[u'http://feeds.feedburner.com/', u'http://feeds.feedburner.com/', u'http://feeds.feedburner.com/']

看起来问题出在~r部分以及之后的事情上。

4 个答案:

答案 0 :(得分:2)

这些数据来自哪里?我建议使用html解析器而不是尝试使用正则表达式提取。如果来自html

,您可以从标签中提取完整值

下面我把你的字符串放在test.html中(对于python,使用beautifulsoup作为例子)

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(open(r'A:\test.html'))
>>> [x['src'] for x in soup.findAll('img')]
['http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw', 'http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0', 'http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8']

答案 1 :(得分:0)

您遗漏了正则表达式中的~字符:

http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+~]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+

btw:this是在Python中测试正则表达式的超级方法!

答案 2 :(得分:0)

试试这个脚本:

text1="""Russia's National Settlement Depository discusses why it believes the biggest blockchain opportunities have yet to be uncovered.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw" width="1" />|One of the co-founder of digital currency startup Stellar announced their resignation today.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0" width="1" />|The editorial board for Bloomberg News has called for a permissive regulatory environment for blockchain development.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8" width="1" />|"""
import re
print re.findall(r'(https?://\S+)', text1)

,结果是

['http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw"',   'http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0"', 'http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8"']

答案 3 :(得分:0)

试试这个:

(?:src=)(".*?")

并获得group \ 1

DEMO