我正在抓取一个网站来获取公司和产品详情。 它有div标签,其中有li标签,我想获得div标签内的所有li标签。 我使用的是python 3.5.1和BeautifulSoup
我的代码:
from bs4 import BeautifulSoup
import urllib.request
import re
r = urllib.request.urlopen('http://i.cantonfair.org.cn/en/ExpExhibitorList.aspx?k=glassware')
soup = BeautifulSoup(r, "html.parser")
links = soup.find_all("a", href=re.compile(r"expexhibitorlist\.aspx\?categoryno=[0-9]+"))
linksfromcategories = ([link["href"] for link in links])
string = "http://i.cantonfair.org.cn/en/"
linksfromcategories = [string + x for x in linksfromcategories]
for link in linksfromcategories:
response = urllib.request.urlopen(link)
soup2 = BeautifulSoup(response, "html.parser")
links2 = soup2.find_all("a", href=re.compile(r"\ExpExhibitorList\.aspx\?categoryno=[0-9]+"))
linksfromsubcategories = ([link["href"] for link in links2])
linksfromsubcategories = [string + x for x in linksfromsubcategories]
for link in linksfromsubcategories:
response = urllib.request.urlopen(link)
soup3 = BeautifulSoup(response, "html.parser")
links3 = soup3.find_all("a", href=re.compile(r"\ExpExhibitorList\.aspx\?categoryno=[0-9]+"))
linksfromsubcategories2 = ([link["href"] for link in links3])
linksfromsubcategories2 = [string + x for x in linksfromsubcategories2]
for link in linksfromsubcategories2:
response2 = urllib.request.urlopen(link)
soup4 = BeautifulSoup(response2, "html.parser")
companylink = soup4.find_all("a", href=re.compile(r"\expCompany\.aspx\?corpid=[0-9]+"))
companylink = ([link["href"] for link in companylink])
companylink = [string + x for x in companylink]
for link in companylink:
response3 = urllib.request.urlopen(link)
soup5 = BeautifulSoup(response3, "html.parser")
companydetail = soup5.find_all("div", id="contact")
for element in companydetail:
companyname = element.a[0].get_text()
print (companyname)
companyaddress = element.a[1].get_text()
print (companyaddress)And I am getting error
我收到错误
Traceback (most recent call last):
File "D:\python\phase3.py", line 54, in <module>
lis = companydetail.find_all('li')
AttributeError: 'ResultSet' object has no attribute 'find_all'
答案 0 :(得分:1)
companydetail
是ResultSet
。也就是说,它是一个包含许多元素的可迭代对象(如list
或set
)。由于您尝试在此.find_all()
对象上调用ResultSet
,因此发生错误。您应该像这样迭代这个对象,并在find_all()
中的元素上调用ResultSet
:
for d in companydetail:
lis = d.find_all('li')
或者使用列表理解来获取li
中所有companydetail
的列表:
lis = [ li for d.find_all('li') for d in companydetail ]