我对此并不聪明,因为你可以说。我想用1行抓住2件事。
例如
<a href="(URL TO GRAB)">(TITLE TO GRAB)</a>
<a href="(URL TO GRAB)" rel="nofollow">(TITLE TO GRAB)</a>
网址和标题始终以http或https
开头<a href="http(s)://www.whatever.com/1.html">http(s)://www.whatever.com/1.html</a>
<a href="http(s)://www.whatever.com/2.html" rel="nofollow">http(s)://www.whatever.com/2.html</a>
我尝试使用替换方法删除带有空格的rel="nofollow">
,但有130个其他re="variables'>
,我只想要"nofollow"
一个而不是{0}}想写一些替换。
item_infos=<a href="([^"]+)"([^"]+)</a>
item_order=url.tmp|title.tmp
item_skill=rss
用于kodi / xbmc抓取reddit的python。
编辑: 谢谢你们的帮助。我目前正在使用Jon提供的那个
item_infos=<a href="([^"]+)"[^>]*>([^<]+)</a>
似乎工作但我不会知道,直到线程稍后更新。再次感谢:)
答案 0 :(得分:1)
您可以使用HTML解析器,例如BeautifulSoup。这是一个例子:
from bs4 import BeautifulSoup
html = '''<a href="http(s)://www.whatever.com/1.html">http(s)://www.whatever.com/1.html</a>
<a href="http(s)://www.whatever.com/2.html" rel="nofollow">http(s)://www.whatever.com/2.html</a>'''
soup = BeautifulSoup(html)
print 'href :', soup.a['href']
print 'title :', soup.a.text
for tag in soup.find_all('a'):
print 'href: {}, title: {}'.format(tag['href'], tag.text)
<强>输出强>
href : http(s)://www.whatever.com/1.html title : http(s)://www.whatever.com/1.html href: http(s)://www.whatever.com/1.html, title http(s)://www.whatever.com/1.html href: http(s)://www.whatever.com/2.html, title http(s)://www.whatever.com/2.html
答案 1 :(得分:1)
如果您只是尝试匹配锚点并提取URL并显示部分,可能会这样:
item_infos=<a href="([^"]+)"[^>]*>([^<]+)</a>