Question

通常我能够编写一个可用于抓取的脚本，但是我在为我正在研究的这个研究项目所征集的表格中抓取这个网站时遇到了一些困难。我计划在输入目标状态的URL之前验证脚本是否在一个状态下工作。

import requests
import bs4 as bs

url = ("http://programs.dsireusa.org/system/program/detail/284")
dsire_get = requests.get(url)
soup = bs.BeautifulSoup(dsire_get.text,'lxml')
table = soup.findAll('div', {'data-ng-controller': 'DetailsPageCtrl'})
print(table) 
#I'm printing "Table" just to ensure that the table information I'm looking for is within this sections

我不确定该网站是否试图阻止人们进行抓取，但我想要抓取的所有信息都在＆＃34;＆amp; quot＆＃34;如果你看看是什么表输出。

Answer 1

使用JavaScript呈现文本。首先使用dryscrape

呈现页面

（如果您不想使用dryscrape，请参阅Web-scraping JavaScript page with Python）

然后，文本可以在呈现后从页面上的不同位置（即已呈现的位置）提取。

作为示例，此代码将从摘要中提取HTML。

import bs4 as bs
import dryscrape

url = ("http://programs.dsireusa.org/system/program/detail/284")
session = dryscrape.Session()
session.visit(url)
dsire_get = session.body()
soup = bs.BeautifulSoup(dsire_get,'html.parser')
table = soup.findAll('div', {'class': 'programSummary ng-binding'})
print(table[0])

输出：

<div class="programSummary ng-binding" data-ng-bind-html="program.summary"><p>
<strong>Eligibility and Availability</strong></p>
<p>
Net metering is available to all "qualifying facilities" (QFs), as defined by the federal <i>Public Utility Regulatory Policies Act of 1978</i> (PURPA), which pertains to renewable energy systems and combined heat and power systems up to 80 megawatts (MW) in capacity. There is no statewide cap on the aggregate capacity of net-metered systems.</p>
<p>
All utilities subject to Public ...

Answer 2

所以我终于设法解决了这个问题，并且成功地从Javascript页面获取数据，如果有人在尝试使用python使用windows来刮取javascript网页时遇到同样的问题，那么以下代码对我有用（dryscrape不兼容）

import bs4 as bs
from selenium import webdriver  
from selenium.common.exceptions import NoSuchElementException  
from selenium.webdriver.common.keys import Keys
browser = webdriver.Chrome()
url = ("http://programs.dsireusa.org/system/program/detail/284")
browser.get(url)
html_source = browser.page_source
browser.quit()
soup = bs.BeautifulSoup(html_source, "html.parser")
table = soup.find('div', {'class': 'programOverview'})
data = []
for n in table.findAll("div", {"class": "ng-binding"}):
    trip = str(n.text)
    data.append(trip)

麻烦用BS4刮痧网站

2 个答案: