我想抓取数据站点。 但是我的代码有问题
我想找到为什么查找对象错误的原因 并在堆栈溢出中搜索,但我找不到此代码出了什么问题
from bs4 import BeautifulSoup
from pymongo import MongoClient
import requests
from matplotlib import font_manager, rc
client = MongoClient("localhost", 27017)
database = client.datadb
collection = database.datacol
page = requests.get("https://www.worlddata.info/average-income.php")
soup = BeautifulSoup(page.content, 'html.parser')
general_list = soup.find("tr")
#list_of_tr = general_list.find("tr")
for in_each_tr in general_list:
list_of_td0 = general_list.find_all("td")[0]
list_of_td1 = general_list.find_all("td")[1]
general_list = collection.insert_one({"country":list_of_td0.get_text(), "income":list_of_td1.get_text()})
Traceback (most recent call last):
File "C:/Users/SAMSUNG/PycharmProjects/simple/data.py", line 18, in <module>
for in_each_tr in general_list:
TypeError: 'NoneType' object is not iterable
答案 0 :(得分:0)
您的general_list
的值为none
。
在对对象执行操作之前,您需要添加验证。
我假设此地址返回了禁止的错误,因此响应没有<tr>
。
如果您将地址更改为:
page = requests.get("https://www.google.com")
soup = BeautifulSoup(page.content, 'html.parser')
general_list = soup.find("tr")
for tr in general_list:
print(tr)
有效。
答案 1 :(得分:0)
“ https://www.worlddata.info/average-income.php” 正在根据ajax请求加载数据,因此您需要使用硒下载动态内容。
首先根据浏览器安装Selenium Web驱动程序。
导入硒Web驱动程序
from selenium import webdriver
下载网页内容
driver = webdriver.Chrome("/usr/bin/chromedriver")
driver.get('https://www.worlddata.info/average-income.php')
"/usr/bin/chromedriver"
Webdriver路径在哪里
获取html内容
soup = BeautifulSoup(driver.page_source, 'lxml')
现在您将获得tr tag
对象
general_list = soup.find("tr")
答案 2 :(得分:0)
似乎requests.get("https://www.worlddata.info/average-income.php")
给出了403作为响应,这意味着禁止访问该网页。
我进行了一次谷歌搜索,发现this StackOverflow post。它说某些网页可以拒绝GET
不能识别User-Agent
的请求。
如果您像这样向requests.get
添加标头:
header = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
page = requests.get("https://www.worlddata.info/average-income.php", headers=header)
然后,GET
请求的响应将为200,并且您的代码应按预期工作。
答案 3 :(得分:0)
我还有更多问题
from bs4 import BeautifulSoup
from pymongo import MongoClient
import requests
from selenium import webdriver
from matplotlib import font_manager, rc
client = MongoClient("localhost", 27017)
database = client.datadb
collection = database.datacol
driver = webdriver.Chrome("C:\chromedriver")
driver.get('https://www.worlddata.info/average-income.php')
page = requests.get("https://www.worlddata.info/average-income.php")
soup = BeautifulSoup(driver.page_source, 'lxml')
#soup = BeautifulSoup(page.content, 'html.parser')
general_list = soup.find("tr")
for in_each_tr in general_list:
list_of_td0 = general_list.find_all("a")
list_of_td1 = general_list.find_all(class_="right nowrap")[0]
list_all = collection.insert_one({"country:" + list_of_td0.get_text() + ", income
:" + list_of_td1.get_text()})
我有这个错误
selenium.common.exceptions.WebDriverException:消息:无法访问Chrome (会议信息:chrome = 74.0.3729.169) (驱动程序信息:chromedriver = 74.0.3729.6(255758eccf3d244491b8a1317aa76e1ce10d57e9-refs / branch-heads / 3729 @ {#29}),platform = Windows NT 10.0.17763 x86_64)