我想从链接列表
中使用python中的模式匹配来提取部分URL示例:
http://www.fairobserver.com/about/
http://www.fairobserver.com/about/interview/
这是我的正则表达式:
re.match(r'(http?|ftp)(://[a-zA-Z0-9+&/@#%?=~_|!:,.;]*)(.\b[a-z]{1,3}\b)(/about[a-zA-Z-_]*/?)', str(href), re.IGNORECASE)
我想获得仅以/about
或/about/
结尾的链接
但上面的正则表达式选择了所有带有“约”字的链接
答案 0 :(得分:0)
建议您使用适当的库解析您的网址,例如:而是urlparse
。
E.g。
import urlparse
samples = [
"http://www.fairobserver.com/about/",
"http://www.fairobserver.com/about/interview/",
]
def about_filter(urls):
for url in urls:
parsed = urlparse.urlparse(url)
if parsed.path.endswith('/about/'):
yield url
产量:
>>> print list(about_filter(samples))
['http://www.fairobserver.com/about/']
或者
def about_filter(urls):
for url in urls:
parsed = urlparse.urlparse(url)
if parsed.path.startswith('/about'):
yield url
屈服
>>> print list(about_filter(samples))
['http://www.fairobserver.com/about/', 'http://www.fairobserver.com/about/interview/']
答案 1 :(得分:0)
如果您只想要以html解析器和str.endwith:
结尾的链接import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.fairobserver.com/about/")
print(list(filter(lambda x: x.endswith(("/about", '/about/')),
(a["href"] for a in BeautifulSoup(r.content).find_all("a", href=True)))))
你也可以使用带有BeautifulSoup的正则表达式:
r = requests.get("http://www.fairobserver.com/about/")
print([a["href"] for a in BeautifulSoup(r.content).find_all(
"a", href=re.compile(".*/about/$|.*/about$"))])
答案 2 :(得分:0)
根据您的评论说明完全匹配/about/
或/about
的路径。
下面是在python2 / 3中使用urlparse。
try:
# https://docs.python.org/3.5/library/urllib.parse.html?highlight=urlparse#urllib.parse.urlparse
# python 3
from urllib.parse import urlparse
except ImportError:
# https://docs.python.org/2/library/urlparse.html#urlparse.urlparse
# python 2
from urlparse import urlparse
urls = (
'http://www.fairobserver.com/about/',
'http://www.fairobserver.com/about/interview/',
'http://www.fairobserver.com/interview/about/',
)
for url in urls:
print("{}: path is /about? {}".format(url,
urlparse(url.rstrip('/')).path == '/about'))
这是输出:
http://www.fairobserver.com/about/: path is /about? True
http://www.fairobserver.com/about/interview/: path is /about? False
http://www.fairobserver.com/interview/about/: path is /about? False
重要的部分是urlparse(url.rstrip('/')).path == '/about'
,通过在解析之前剥离尾随/
来规范化url,这样我们就不必使用正则表达式了。