我一直在学习如何使用BeautifulSoup抓取greatschools.org网站。尽管在这里和其他地方寻找不同的解决方案,但我还是陷入了僵局。 通过使用chrome上的“检查”功能,我可以看到该网站具有表格标签,但是find_all('tr')或find_all('table')或find_all('tbody')返回一个空列表。我想念什么?
这是我正在使用的代码块:
import requests
from bs4 import BeautifulSoup
url = "https://www.greatschools.org/pennsylvania/bethlehem/schools/?
tableView=Overview&view=table"
page_response = requests.get(url)
content = BeautifulSoup(page_response.text,"html.parser")
table=content.find_all('table')
table
输出为:[]
在此先感谢您的帮助。
答案 0 :(得分:3)
您可以使用Selenium,因为页面看起来是动态的。如果愿意,您仍然可以使用beautifulsoup进行解析。当涉及到标签作为表格时,我选择使用熊猫来读取html。您将需要做一些工作来拆分文本/列,以及第一列中不应该做的事情。)
让我知道这是否对您有用。
import pandas as pd
from selenium import webdriver
url = "https://www.greatschools.org/pennsylvania/bethlehem/schools/?tableView=Overview&view=table"
driver = webdriver.Chrome('C:\chromedriver_win32\chromedriver.exe')
driver.get(url)
html = driver.page_source
table = pd.read_html(html)
df = table[0]
driver.close()
输出
print (table[0])
School ... District
0 9/10Above averageSouthern Lehigh Intermediate ... ... Southern Lehigh School District
1 8/10Above averageHanover El School3890 Jackson... ... Bethlehem Area School District
2 8/10Above averageLehigh Valley Charter High Sc... ... Lehigh Valley Charter High School For The Arts
3 6/10AverageCalypso El School1021 Calypso Ave, ... ... Bethlehem Area School District
4 6/10AverageMiller Heights El School3605 Allen ... ... Bethlehem Area School District
5 6/10AverageAsa Packer El School1650 Kenwood Dr... ... Bethlehem Area School District
6 6/10AverageLehigh Valley Academy Regional Cs15... ... Lehigh Valley Academy Regional Cs
7 5/10AverageNortheast Middle School1170 Fernwoo... ... Bethlehem Area School District
8 5/10AverageNitschmann Middle School1002 West U... ... Bethlehem Area School District
9 5/10AverageThomas Jefferson El School404 East ... ... Bethlehem Area School District
10 4/10Below averageJames Buchanan El School1621 ... ... Bethlehem Area School District
11 4/10Below averageLincoln El School1260 Gresham... ... Bethlehem Area School District
12 4/10Below averageGovernor Wolf El School1920 B... ... Bethlehem Area School District
13 4/10Below averageSpring Garden El School901 No... ... Bethlehem Area School District
14 4/10Below averageClearview El School2121 Abing... ... Bethlehem Area School District
15 4/10Below averageLiberty High School1115 Linde... ... Bethlehem Area School District
16 4/10Below averageEast Hills Middle School2005 ... ... Bethlehem Area School District
17 4/10Below averageFreedom High School3149 Chest... ... Bethlehem Area School District
18 3/10Below averageMarvine El School1425 Livings... ... Bethlehem Area School District
19 3/10Below averageWilliam Penn El School1002 Ma... ... Bethlehem Area School District
20 3/10Below averageLehigh Valley Dual Language C... ... Lehigh Valley Dual Language Charter School
21 2/10Below averageBroughal Middle School114 Wes... ... Bethlehem Area School District
22 2/10Below averageDonegan El School1210 East 4t... ... Bethlehem Area School District
23 2/10Below averageFountain Hill El School1330 C... ... Bethlehem Area School District
24 Currently unratedSt. Anne School375 Hickory St... ... NaN
[25 rows x 7 columns]
现在,如果您仍然想使用BeautifulSoup,因为也许您正在尝试同时提取其中的某些链接或表中的其他标签(也许仅获取表不足以完成您想做的事情? ),则一旦获得page_response
,您就可以像往常一样继续使用bs4。
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://www.greatschools.org/pennsylvania/bethlehem/schools/?tableView=Overview&view=table"
driver = webdriver.Chrome('C:\chromedriver_win32\chromedriver.exe')
driver.get(url)
page_response = driver.page_source
content = BeautifulSoup(page_response,'html.parser')
table=content.find_all('table')
table
driver.close()
答案 1 :(得分:2)
该表由Javascript生成,但是在页面源中有该表的JSON数据。
要获取数据,可以使用BeautifulSoup
和json
page_response = requests.get(url)
content = BeautifulSoup(page_response.text, "html.parser")
scripts = content.find_all('script')
jsonObj = None
for script in scripts:
if 'gon.search' in script.text:
jsonStr = script.text.split('gon.search=')[1].split(';')
jsonObj = json.loads(jsonStr[0])
for school in jsonObj['schools']:
print(school['name'])
或使用re
和json
page_response = requests.get(url)
jsonStr = re.search(r'gon.search=(.*?);', page_response.text).group(1)
jsonObj = json.loads(jsonStr)
for school in jsonObj['schools']:
print(school['name'])