Python正则表字符串列表

时间:2016-04-29 19:07:27

标签: python regex python-2.7

我试图从字符串列表中提取网址。样本清单:

import re
p = ['<img class="alignnone size-full wp-image-2087" src="http://www.sample.com/test.jpg" alt="0wCR41v" width="540" height="720" srcset="http://www.sample.com/test-225x300.jpg 225w, http://www.sample.com/test.jpg 540w" sizes="(max-width: 540px) 100vw, 540px" />', '<img class="alignnone size-large wp-image-2133" src="http://www.sample.com/test2.jpg" alt="NtAboHF" width="583" height="1024" srcset="http://www.happyfridaygents.com/wp-content/uploads/2016/04/NtAboHF-768x1349.jpg 768w, http://www.sample.com/test2.jpg 583w, http://www.happyfridaygents.com/wp-content/uploads/2016/04/NtAboHF.jpg 828w" sizes="(max-width: 583px) 100vw, 583px" />']

我想提取src =&#34之后的http://www.sample.com/test.jpg部分。一部分。

如果p只是一个字符串,我可以使用findall:

t = re.findall('src="(.+)" alt', p)
print t

但是我如何迭代列表并返回P中所有网址的列表?

5 个答案:

答案 0 :(得分:0)

这是否符合您的要求?

import re
p = ['<img class="alignnone size-full wp-image-2087" src="http://www.sample.com/test.jpg" alt="0wCR41v" width="540" height="720" srcset="http://www.sample.com/test-225x300.jpg 225w, http://www.sample.com/test.jpg 540w" sizes="(max-width: 540px) 100vw, 540px" />', '<img class="alignnone size-large wp-image-2133" src="http://www.sample.com/test2.jpg" alt="NtAboHF" width="583" height="1024" srcset="http://www.happyfridaygents.com/wp-content/uploads/2016/04/NtAboHF-768x1349.jpg 768w, http://www.sample.com/test2.jpg 583w, http://www.happyfridaygents.com/wp-content/uploads/2016/04/NtAboHF.jpg 828w" sizes="(max-width: 583px) 100vw, 583px" />']
outList = [re.findall('src="(.+)" alt', pp)[0] for pp in p]

答案 1 :(得分:0)

使用列表理解:

$scope.filterEvents = function(item) {
  return item.StartDate !== null || item.EndDate !== null;
}

这将为您提供可以链接在一起的列表列表:

<ul>
    <li ng-repeat="item in damageEvants | filter:filterEvents">{{item.id}}</li>
</ul>

其他地方的正则表达式答案更优雅。

答案 2 :(得分:0)

如何在循环中完成:

>>> pe = re.compile('src="(.+)" alt')
>>> for img in p:
...     print pe.findall(img)
... 
['http://www.sample.com/test.jpg']
['http://www.sample.com/test2.jpg']

答案 3 :(得分:0)

for i in p:
    t = re.findall('src="(.+)" alt', i)
    print t

更新

k=[re.findall('src="(.+)" alt',i) for i in p]
[item for sublist in k for item in sublist]

['http://www.sample.com/test.jpg','http://www.sample.com/test2.jpg']

答案 4 :(得分:0)

这是使用BeautifulSoup的解决方案:

>>> p = ['<img class="alignnone size-full wp-image-2087" src="http://www.sample.com/test.jpg" alt="0wCR41v" width="540" height="720" srcset="http://www.sample.com/test-225x300.jpg 225w, http://www.sample.com/test.jpg 540w" sizes="(max-width: 540px) 100vw, 540px" />', '<img class="alignnone size-large wp-image-2133" src="http://www.sample.com/test2.jpg" alt="NtAboHF" width="583" height="1024" srcset="http://www.happyfridaygents.com/wp-content/uploads/2016/04/NtAboHF-768x1349.jpg 768w, http://www.sample.com/test2.jpg 583w, http://www.happyfridaygents.com/wp-content/uploads/2016/04/NtAboHF.jpg 828w" sizes="(max-width: 583px) 100vw, 583px" />']

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(''.join(p), 'html.parser')
>>> src_links = [img['src'] for img in soup.find_all('img')]

>>> src_links
[u'http://www.sample.com/test.jpg', u'http://www.sample.com/test2.jpg']

如果您确实想使用正则表达式:

>>> regex = re.compile(r'src="(.+)" alt')
>>> [regex.search(img).group(1) for img in p]
['http://www.sample.com/test.jpg', 'http://www.sample.com/test2.jpg']