如何使用Python 2.7.10遍历列表并在引号之间提取文本

时间:2018-02-14 13:03:24

标签: python regex list text-parsing

我正在尝试遍历一个长列表(让我们称之为url_list),其中每个项目都是这样的:

<a href="https://www.example.com/5th-february-2018/" itemprop="url">5th February 2018</a>, <a href="https://www.example.com/4th-february-2018/" itemprop="url">4th February 2018</a>, <a href="https://www.example.com/3rd-february-2018/" itemprop="url">3rd February 2018</a>, <a href="https://www.example.com/2nd-february-2018/" itemprop="url">2nd February 2018</a>,

等等。我想迭代列表,只保留前两个引号之间的文本,然后扔掉其余的 - 即:

https://www.example.com/5th-february-2018/, https://www.example.com/4th-february-2018/, https://www.example.com/3rd-february-2018/, https://www.example.com/2nd-february-2018/,

所以基本上我试图返回一个很好的清洁网址列表。我没有太多的运气迭代列表和分割引号 - 是否有更好的方法来做到这一点?有没有办法丢弃itemprop=字符串之后的所有内容?

3 个答案:

答案 0 :(得分:1)

使用正则表达式:

import re

url_list = ['<a href="https://www.example.com/5th-february-2018/" itemprop="url">5th February 2018</a>', '<a href="https://www.example.com/4th-february-2018/" itemprop="url">4th February 2018</a>']
for i in url_list:
    print re.search("(?P<url>https?://[^\s]+)/", i).group("url")

<强>输出:

https://www.example.com/5th-february-2018
https://www.example.com/4th-february-2018

答案 1 :(得分:0)

您是否尝试过使用拆分功能拆分&#34;然后从结果列表中获取第二个条目?

urls=[]
for url_entry in url_list:
    url = url_entry.split('\"')[1]
    urls.append(url)

答案 2 :(得分:0)

这听起来有点像XY problem

如果您(或正在)使用BeautifulSoup来解析HTML,那将变得更加轻松:

from bs4 import BeautifulSoup

html_text = '''<a href="https://www.example.com/5th-february-2018/" itemprop="url">5th February 2018</a>
<a href="https://www.example.com/4th-february-2018/" itemprop="url">4th February 2018</a>
<a href="https://www.example.com/3rd-february-2018/" itemprop="url">3rd February 2018</a>
<a href="https://www.example.com/2nd-february-2018/" itemprop="url">2nd February 2018</a>'''

soup = BeautifulSoup(html_text)
urls = [x['href'] for x in soup.find_all("a")]
for url in urls:
    print(url)
# https://www.example.com/5th-february-2018/
# https://www.example.com/4th-february-2018/
# https://www.example.com/3rd-february-2018/
# https://www.example.com/2nd-february-2018/