我正在尝试从下面的网址解析名为“ dealer-info”的div类。
https://lists.dot.net/pipermail/mono-list/2007-January/034101.html
我尝试过:
import urllib.request
from bs4 import BeautifulSoup
url = "https://www.nissanusa.com/dealer-locator.html"
text = urllib.request.urlopen(url).read()
soup = BeautifulSoup(text)
data = soup.findAll('div',attrs={'class':'dealer-info'})
for div in data:
links = div.findAll('a')
for a in links:
print(a['href'])
通常,我希望它能正常工作,但我得到的结果是:HTTPError: Forbidden
也,尝试了这个。
import urllib.request
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
url = "https://www.nissanusa.com/dealer-locator.html"
headers={'User-Agent':user_agent,}
request=urllib.request.Request(url,None,headers) #The assembled request
response = urllib.request.urlopen(request)
data = response.read() # The data u need
print(data)
这给了我网站上所有的HTML,但是看起来或完全没有任何意义。
我正在尝试获取“经销商信息”的结构化数据集。我正在使用Python 3.6。
答案 0 :(得分:0)
在第一个示例中,服务器可能会因为不假装自己是普通浏览器而被拒绝。您应该尝试结合第二个示例中的用户代理代码和第一个示例中的Beautiful Soup代码:
import urllib.request
from bs4 import BeautifulSoup
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
url = "https://www.nissanusa.com/dealer-locator.html"
headers={'User-Agent':user_agent,}
request=urllib.request.Request(url,None,headers) #The assembled request
response = urllib.request.urlopen(request)
text = response.read()
soup = BeautifulSoup(text, "lxml")
data = soup.findAll('div',attrs={'class':'dealer-info'})
for div in data:
links = div.findAll('a')
for a in links:
print(a['href'])
请记住,如果网站明确试图将Beautiful Soup或其他无法识别的用户代理拒之门外,则它们可能会在您抓取其网站数据时出现问题。您应该咨询https://www.nissanusa.com/robots.txt以及您可能已同意的任何使用条款或服务协议,并遵守这些条款。