Question

我正在使用python3.5.1和BeautifulSoup抓取一个网站我想使用正则表达式搜索特定链接：我的代码：

from bs4 import BeautifulSoup
import urllib.request
import re
r = urllib.request.urlopen('http://i.cantonfair.org.cn/en/expexhibitorlist.aspx?categoryno=404').read()
soup = BeautifulSoup(r,"html.parser")
links = soup.find_all("a", href=re.compile(r"ExpExhibitorList\.aspx\?categoryno=[0-9]+"))    
linksfromcategories = ([link["href"] for link in links])
print(linksfromcategories)

我得到了所有类似的链接

['/cn/ExpExhibitorList.aspx?categoryno=432', 'ExpExhibitorList.aspx?categoryno=432003']

但我不想要

'/cn/ExpExhibitorList.aspx?categoryno=432'

要搜索

Answer 1

只需在正则表达式中使用锚点。

links = soup.find_all("a", href=re.compile(r"^ExpExhibitorList\.aspx\?categoryno=[0-9]+$"))

这将匹配具有与上述正则表达式匹配的精确值的所有a标记。

如何在python中使用正则表达式搜索特定链接

1 个答案: