Question

我对此并不聪明，因为你可以说。我想用1行抓住2件事。

例如

<a href="(URL TO GRAB)">(TITLE TO GRAB)</a>
<a href="(URL TO GRAB)" rel="nofollow">(TITLE TO GRAB)</a>

网址和标题始终以http或https

开头

<a href="http(s)://www.whatever.com/1.html">http(s)://www.whatever.com/1.html</a>
<a href="http(s)://www.whatever.com/2.html" rel="nofollow">http(s)://www.whatever.com/2.html</a>

我尝试使用替换方法删除带有空格的rel="nofollow">，但有130个其他re="variables'>，我只想要"nofollow"一个而不是{0}}想写一些替换。

item_infos=<a href="([^"]+)"([^"]+)</a>
item_order=url.tmp|title.tmp
item_skill=rss

用于kodi / xbmc抓取reddit的python。

编辑：谢谢你们的帮助。我目前正在使用Jon提供的那个

item_infos=<a href="([^"]+)"[^>]*>([^<]+)</a>

似乎工作但我不会知道，直到线程稍后更新。再次感谢：）

Answer 1

您可以使用HTML解析器，例如BeautifulSoup。这是一个例子：

from bs4 import BeautifulSoup

html = '''<a href="http(s)://www.whatever.com/1.html">http(s)://www.whatever.com/1.html</a>
<a href="http(s)://www.whatever.com/2.html" rel="nofollow">http(s)://www.whatever.com/2.html</a>'''

soup = BeautifulSoup(html)
print 'href :', soup.a['href']
print 'title :', soup.a.text

for tag in soup.find_all('a'):
    print 'href: {}, title: {}'.format(tag['href'], tag.text)

<强>输出

href : http(s)://www.whatever.com/1.html
title : http(s)://www.whatever.com/1.html
href: http(s)://www.whatever.com/1.html, title http(s)://www.whatever.com/1.html
href: http(s)://www.whatever.com/2.html, title http(s)://www.whatever.com/2.html

Answer 2

如果您只是尝试匹配锚点并提取URL并显示部分，可能会这样：

item_infos=<a href="([^"]+)"[^>]*>([^<]+)</a>

在1个字符串中抓取2个项目？

2 个答案: