我有网站,例如http://site.com
我想获取主页并仅提取与正则表达式匹配的链接,例如.*somepage.*
html代码中的链接格式可以是:
<a href="http://site.com/my-somepage">url</a>
<a href="/my-somepage.html">url</a>
<a href="my-somepage.htm">url</a>
我需要输出格式:
http://site.com/my-somepage
http://site.com/my-somepage.html
http://site.com/my-somepage.htm
输出网址必须始终包含域名。
对此有什么快速的python解决方案?
答案 0 :(得分:2)
您可以使用lxml.html
:
from lxml import html
url = "http://site.com"
doc = html.parse(url).getroot() # download & parse webpage
doc.make_links_absolute(url)
for element, attribute, link, _ in doc.iterlinks():
if (attribute == 'href' and element.tag == 'a' and
'somepage' in link): # or e.g., re.search('somepage', link)
print(link)
或者使用beautifulsoup4
:
import re
try:
from urllib2 import urlopen
from urlparse import urljoin
except ImportError: # Python 3
from urllib.parse import urljoin
from urllib.request import urlopen
from bs4 import BeautifulSoup, SoupStrainer # pip install beautifulsoup4
url = "http://site.com"
only_links = SoupStrainer('a', href=re.compile('somepage'))
soup = BeautifulSoup(urlopen(url), parse_only=only_links)
urls = [urljoin(url, a['href']) for a in soup(only_links)]
print("\n".join(urls))
答案 1 :(得分:1)
使用HTML分析模块,例如BeautifulSoup 一些代码(只有一些):
from bs4 import BeautifulSoup
import re
html = '''<a href="http://site.com/my-somepage">url</a>
<a href="/my-somepage.html">url</a>
<a href="my-somepage.htm">url</a>'''
soup = BeautifulSoup(html)
links = soup.find_all('a',{'href':re.compile('.*somepage.*')})
for link in links:
print link['href']
输出:
http://site.com/my-somepage
/my-somepage.html
my-somepage.htm
你应该能够从这么多数据中获得你想要的格式......
答案 2 :(得分:1)