如何在python中通过正则表达式找到url?

时间:2016-02-24 15:46:01

标签: python regex beautifulsoup

如何查找具有正则表达式模式的所有网址。

我试图通过正则表达式找到所有具有模式的url,但它错误地给出了TypeError:期望的字符串或类似字节的对象

我正在使用python 3.5.1

from bs4 import BeautifulSoup
import urllib.request
import re
 r = urllib.request.urlopen('http://i.cantonfair.org.cn/en/expexhibitorlist.aspx?categoryno=404').read()
soup = BeautifulSoup(r,"html.parser")

regex = '(<a href="expexhibitorlist.aspx\?categoryno=)?[0-9]+?>'
pattern = re.compile(regex)
mycateurl = re.findall(pattern,soup)
print (mycateurl)

1 个答案:

答案 0 :(得分:0)

soup.find_all()并传递compiled regular expression作为href参数值。

工作示例:

import re
import urllib.request

from bs4 import BeautifulSoup


r = urllib.request.urlopen('http://i.cantonfair.org.cn/en/ExpExhibitorList.aspx?k=glassware')
soup = BeautifulSoup(r, "html.parser")

links = soup.find_all("a", href=re.compile(r"expexhibitorlist\.aspx\?categoryno=[0-9]+"))
print([link["href"] for link in links])

打印:

['expexhibitorlist.aspx?categoryno=411', 'expexhibitorlist.aspx?categoryno=412', 'expexhibitorlist.aspx?categoryno=453', 'expexhibitorlist.aspx?categoryno=410', 'expexhibitorlist.aspx?categoryno=414', 'expexhibitorlist.aspx?categoryno=415', 'expexhibitorlist.aspx?categoryno=403', 'expexhibitorlist.aspx?categoryno=404', 'expexhibitorlist.aspx?categoryno=405', 'expexhibitorlist.aspx?categoryno=408', 'expexhibitorlist.aspx?categoryno=402', 'expexhibitorlist.aspx?categoryno=401', 'expexhibitorlist.aspx?categoryno=454', 'expexhibitorlist.aspx?categoryno=455', 'expexhibitorlist.aspx?categoryno=451', 'expexhibitorlist.aspx?categoryno=406', 'expexhibitorlist.aspx?categoryno=407', 'expexhibitorlist.aspx?categoryno=416', 'expexhibitorlist.aspx?categoryno=427', 'expexhibitorlist.aspx?categoryno=434', 'expexhibitorlist.aspx?categoryno=435', 'expexhibitorlist.aspx?categoryno=436', 'expexhibitorlist.aspx?categoryno=437', 'expexhibitorlist.aspx?categoryno=438', 'expexhibitorlist.aspx?categoryno=439', 'expexhibitorlist.aspx?categoryno=440', 'expexhibitorlist.aspx?categoryno=441', 'expexhibitorlist.aspx?categoryno=442', 'expexhibitorlist.aspx?categoryno=443', 'expexhibitorlist.aspx?categoryno=444', 'expexhibitorlist.aspx?categoryno=445', 'expexhibitorlist.aspx?categoryno=446', 'expexhibitorlist.aspx?categoryno=447', 'expexhibitorlist.aspx?categoryno=448', 'expexhibitorlist.aspx?categoryno=449', 'expexhibitorlist.aspx?categoryno=452', 'expexhibitorlist.aspx?categoryno=417', 'expexhibitorlist.aspx?categoryno=418', 'expexhibitorlist.aspx?categoryno=419', 'expexhibitorlist.aspx?categoryno=420', 'expexhibitorlist.aspx?categoryno=421', 'expexhibitorlist.aspx?categoryno=422', 'expexhibitorlist.aspx?categoryno=423', 'expexhibitorlist.aspx?categoryno=424', 'expexhibitorlist.aspx?categoryno=425', 'expexhibitorlist.aspx?categoryno=426', 'expexhibitorlist.aspx?categoryno=428', 'expexhibitorlist.aspx?categoryno=430', 'expexhibitorlist.aspx?categoryno=431', 'expexhibitorlist.aspx?categoryno=432', 'expexhibitorlist.aspx?categoryno=433']