这是我的错误日志
>>python crawler.py
Traceback (most recent call last):
File "crawler.py", line 163, in <module>
crawler.run()
File "crawler.py", line 90, in run
for index, url in enumerate(self.parse_menu(self.request(self.start_url))):
File "crawler.py", line 116, in parse_menu
menu_tag = soup.find_all(class_="uk-nav uk-nav-side")[1]
**IndexError: list index out of range**
以下是我的代码的一部分
def parse_menu(self, response):
soup = BeautifulSoup(response.content, "html.parser")
menu_tag = soup.find_all(class_="uk-nav uk-nav-side")[1]
for li in menu_tag.find_all("li"):
url = li.a.get("href")
if not url.startswith("http"):
url = "".join([self.domain, url])
yield url
答案 0 :(得分:0)
以下是我发现的一段代码。我认为您尝试访问的网站禁止抓取工具,因此您必须通过提供浏览器标头和请求来屏蔽您的脚本。
from bs4 import BeautifulSoup
from lxml import html
import requests
url = "https://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000"
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.89 Safari/537.36'}
page = requests.get(url, headers = headers)
soup = BeautifulSoup(page.content, 'html.parser')
menu_tag = soup.find_all(class_="uk-nav uk-nav-side")[1]