Question

我想使用python webscraping来提供我做过的ml应用程序，它会对摘要进行总结，以简化我的日常研究工作。我似乎遇到了一些困难，因为我在网上使用了很多建议，比如这个： Python Selenium accessing HTML source 我一直得到AttributeError：'NoneType'对象没有属性'page_source'/'content'，具体取决于尝试/使用的模块我需要这个来源喂美丽的汤来刮取来源并找到我的ml脚本。我的第一次尝试是使用请求：

from bs4 import BeautifulSoup as BS
import requests
import time
import datetime
print ('start!')
print(datetime.datetime.now())

page="http://www.genecards.org/cgi-bin/carddisp.pl?gene=COL1A1&keywords=COL1A1"

这是我的目标网页。我通常每天要做20次请求，所以我不想吸收网站，因为我需要同时进行，我想自动执行检索任务，因为最长的部分是获取网址，加载网页，复制并粘贴摘要。我也是合理的，因为我在加载另一页之前尊重一些延迟。我尝试传递作为常规浏览器，因为该网站不喜欢机器人（它不允许/ ProductRedirect和一个我在谷歌找不到的数字？）

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:44.0) Gecko/20100101 Firefox/44.0'}
current_page = requests.get(page,  headers=headers)
print(current_page)
print(current_page.content)
soup=BS(current_page.content,"lxml")

我总是得到没有内容，而请求得到代码200，我可以自己在firefox中加载这个页面。所以我尝试了Selenium

from bs4 import BeautifulSoup as BS
from selenium import webdriver
import time
import datetime
print ('start!')
print(datetime.datetime.now())

browser = webdriver.Firefox()
current_page =browser.get(page)
time.sleep(10)

这适用于并加载页面。我添加了延迟，以确保不会垃圾邮件主机，并确保完全加载页面。然后都没有：

html=current_page.content

，也不

html=current_page.page_source

，也不

html=current_page

作为输入：

soup=BS(html,"lxml")

它总是说它没有page_source属性（因为它应该在selenium调用的Web浏览器窗口中正确加载）。

我不知道下一步该尝试什么。就像用户代理标头不适用于请求一样，硒返回页面没有来源也很奇怪。

我接下来可以尝试什么？谢谢。

请注意，我也尝试过：

browser.get(page)
time.sleep(8)
print(browser)
print(browser.page_source)
html=browser.page_source
soup=BS(html,"lxml")
for summary in soup.find('section', attrs={'id':'_summaries'})
    print(summary)

虽然它可以获得源，但它只是在BS阶段失败; “AttributeError：'NoneType'对象没有属性'find'”

Answer 1

问题是你试图迭代.find()的结果。相反，您需要.find_all()：

for summary in soup.find_all('section', attrs={'id':'_summaries'})
    print(summary)

或者，如果只有一个元素，请不要使用循环：

summary = soup.find('section', attrs={'id':'_summaries'})
print(summary)

Answer 2

您不必将html转换为字符串对象。

尝试：

html = browser.page_source
soup = BS(html,"lxml")

在python中进行Webscraping：BS，selenium和None错误

2 个答案: