我是第一次使用Python编写网络爬虫。我已经完成了一些教程,现在正在尝试我的第一个教程。这是一个非常简单的测试,它产生了我在主题行中指出的错误。
import requests
from bs4 import BeautifulSoup
url = "https://www.autotrader.ca/cars/mercedes-benz/ab/calgary/?rcp=15&rcs=0&srt=3&prx=100&prv=Alberta&loc=T3P%200H2&hprc=True&wcp=True&sts=Used&adtype=Private&showcpo=1&inMarket=advancedSearch"
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
html = requests.get(url,headers={'User-Agent': user_agent})
soup = BeautifulSoup(html, "lxml")
print(soup)
请帮助我试用此代码。任何帮助将不胜感激!
答案 0 :(得分:1)
使用html.text
代替html
。在get()方法中发送与用户代理绑定的标头是一个好习惯。
import requests
from bs4 import BeautifulSoup
url = "https://www.autotrader.ca/cars/mercedes-benz/ab/calgary/?rcp=15&rcs=0&srt=3&prx=100&prv=Alberta&loc=T3P%200H2&hprc=True&wcp=True&sts=Used&adtype=Private&showcpo=1&inMarket=advancedSearch"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text,"lxml")
return soup
答案 1 :(得分:0)
在此行进行更改:
soup = BeautifulSoup(html, "lxml")
到
soup = BeautifulSoup(html.content, "lxml")
或
soup = BeautifulSoup(html.text, "lxml")
这将返回网页的HTML结构。