我正在尝试从HTML文件中抓取<a>...</a>
框之间的字符串。我正在使用带有可识别链接的模式的正则表达式。
我试图为此找到合适的正则表达式模式:((/+\w+)+([:.]?(\w+))?(.org)?)(\W\w+)+
。而且我已经写了代码以刮除我需要的链接。
对于存储在文档中的这些行:
<div class="portal" role="navigation" id='p-navigation'>
<h3>Navigation</h3>
<div class="body">
<ul>
<li id="n-mainpage-description"><a href="/wiki/Main_Page" title="Visit the main page [z]" accesskey="z">Main page</a></li>
<li id="n-contents"><a href="/wiki/Portal:Contents" title="Guides to browsing Wikipedia">Contents</a></li>
<li id="n-featuredcontent"><a href="/wiki/Portal:Featured_content" title="Featured content the best of Wikipedia">Featured content</a></li>
<li id="n-currentevents"><a href="/wiki/Portal:Current_events" title="Find background information on current events">Current events</a></li>
<li id="n-randompage"><a href="/wiki/Special:Random" title="Load a random article [x]" accesskey="x">Random article</a></li>
<li id="n-sitesupport"><a href="//donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en" title="Support us">Donate to Wikipedia</a></li>
</ul>
</div>
</div>
这是我的代码:
def find_link():
links = re.findall(r"((/+\w+)+([:.]?(\w+))?(.org)?)(\W\w+)+")
return links
for link in links:
print(link)
我希望输出是每个换行中的链接,但不是:
('/wiki', '/wiki', '', '', '', '/Main_Page')
('/wiki/Portal', '/Portal', '', '', '', ':Contents')
('/wiki/Portal', '/Portal', '', '', '', ':Featured_content')
('/wiki/Portal', '/Portal', '', '', '', ':Current_events')
('/wiki/Special', '/Special', '', '', '', ':Random')
('//donate.wikimedia.org', '//donate', '.wikimedia', 'wikimedia', '.org', '=en')