在1个字符串中抓取2个项目?

时间:2015-09-01 02:34:21

标签: python regex

我对此并不聪明,因为你可以说。我想用1行抓住2件事。

例如

<a href="(URL TO GRAB)">(TITLE TO GRAB)</a>
<a href="(URL TO GRAB)" rel="nofollow">(TITLE TO GRAB)</a>

网址和标题始终以http或https

开头
<a href="http(s)://www.whatever.com/1.html">http(s)://www.whatever.com/1.html</a>
<a href="http(s)://www.whatever.com/2.html" rel="nofollow">http(s)://www.whatever.com/2.html</a>

我尝试使用替换方法删除带有空格的rel="nofollow">,但有130个其他re="variables'>,我只想要"nofollow"一个而不是{0}}想写一些替换。

item_infos=<a href="([^"]+)"([^"]+)</a>
item_order=url.tmp|title.tmp
item_skill=rss

用于kodi / xbmc抓取reddit的python。

编辑: 谢谢你们的帮助。我目前正在使用Jon提供的那个

item_infos=<a href="([^"]+)"[^>]*>([^<]+)</a>

似乎工作但我不会知道,直到线程稍后更新。再次感谢:)

2 个答案:

答案 0 :(得分:1)

您可以使用HTML解析器,例如BeautifulSoup。这是一个例子:

from bs4 import BeautifulSoup

html = '''<a href="http(s)://www.whatever.com/1.html">http(s)://www.whatever.com/1.html</a>
<a href="http(s)://www.whatever.com/2.html" rel="nofollow">http(s)://www.whatever.com/2.html</a>'''

soup = BeautifulSoup(html)
print 'href :', soup.a['href']
print 'title :', soup.a.text

for tag in soup.find_all('a'):
    print 'href: {}, title: {}'.format(tag['href'], tag.text)

<强>输出

href : http(s)://www.whatever.com/1.html
title : http(s)://www.whatever.com/1.html
href: http(s)://www.whatever.com/1.html, title http(s)://www.whatever.com/1.html
href: http(s)://www.whatever.com/2.html, title http(s)://www.whatever.com/2.html

答案 1 :(得分:1)

如果您只是尝试匹配锚点并提取URL并显示部分,可能会这样:

item_infos=<a href="([^"]+)"[^>]*>([^<]+)</a>