我想抓取列表中的网址。基本上我正在抓一个网站,我正在抓一个链接,因为我找到了特定的链接 抓取这些链接,我搜索另一个特定的链接刮掉它。 我的代码:
from bs4 import BeautifulSoup
import urllib.request
import re
r = urllib.request.urlopen('http://i.cantonfair.org.cn/en/ExpExhibitorList.aspx?k=glassware')
soup = BeautifulSoup(r, "html.parser")
links = soup.find_all("a", href=re.compile(r"expexhibitorlist\.aspx\?categoryno=[0-9]+"))
linksfromcategories = ([link["href"] for link in links])
string = "http://i.cantonfair.org.cn/en/"
linksfromcategories = [string + x for x in linksfromcategories]
subcatlinks = list()
for link in linksfromcategories:
response = urllib.request.urlopen(link)
soup2 = BeautifulSoup(response, "html.parser")
links2 = soup2.find_all("a", href=re.compile(r"ExpExhibitorList\.aspx\?categoryno=[0-9]+"))
linksfromsubcategories = ([link["href"] for link in links2])
subcatlinks.append(linksfromsubcategories)
responses = urllib.request.urlopen(subcatlinks)
soup3 = BeautifulSoup(responses, "html.parser")
print (soup3)
我收到错误
Traceback (most recent call last):
File "D:\python\phase2.py", line 46, in <module>
responses = urllib.request.urlopen(subcatlinks)
File "C:\Users\amanp\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 162, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\amanp\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 456, in open
req.timeout = timeout
AttributeError: 'list' object has no attribute 'timeout'
答案 0 :(得分:1)
您只能一次传递一个链接到urllib.request.urlopen
,而不是整个列表。
所以你需要另一个这样的循环:
for link in subcatlinks:
response = urllib.request.urlopen(link)
soup3 = BeautifulSoup(response, "html.parser")
print(soup3)