我正在尝试从以下网页抓取信息:https://www.tmea.org/programs/all-state/history
我想从第一个下拉菜单中选择几个选项,然后使用Beautiful Soup提取我需要的信息。首先,我尝试使用漂亮的汤来提取不同的选项:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.tmea.org/programs/all-state/history')
soup = BeautifulSoup(page.text, 'html.parser')
body = soup.find(id = 'organization')
options = body.find_all('option')
for name in options:
child = name.contents[0]
print(child)
那可以拉不同的选项,但是我希望能够提交特定的选项并拉出该信息。我尝试添加:
payload = {'organization': '2018 Treble Choir'}
r = requests.post('https://www.tmea.org/programs/all-state/history', data = payload)
print(r.text)
我以前在其他使用POST的页面上使用过此功能,但不太了解为什么这种情况有所不同。使用下拉选项是否意味着我必须使用Selenium之类的东西?我曾经使用过它,但不确定如何将它与Beautiful Soup结合使用。
答案 0 :(得分:2)
1)我没有看到在XHR和Fetch中使用POST(请参阅下面的编辑内容)
2)是的,然后可以使用Selenium。只需像平常一样使用Selenium即可获得表格。呈现表格后,您可以将其输入BeautifulSoup。例如:
url = 'https://www.tmea.org/programs/all-state/history'
driver = webdriver.Chrome()
driver.get(url)
# Your code to find/select the drop down menu and select 2018 Treble Choir
...
...
#Once that page is rendered...
soup = BeautifulSoup(driver.page_source, 'html.parser')
说实话,我不会因为这个看起来像<table>
标签而为此而烦恼BeautifulSoup。让熊猫来做这项工作:
url = 'https://www.tmea.org/programs/all-state/history'
driver = webdriver.Chrome()
driver.get(url)
# Your code to find/select the drop down menu and select 2018 Treble Choir
...
...
#Once that page is rendered...
tables = pd.read_html(driver.page_source)
编辑
我在Doc下找到了POST Request方法。您需要在有效负载中添加一些其他参数:
import pandas as pd
import requests
payload = {
'organization': '2018 Treble Choir',
'instrument': 'All',
'school_op': 'eq',
'school': '',
'city_op': 'eq',
'city': '',
's': '',
'submit': 'Search'}
r = requests.post('https://www.tmea.org/programs/all-state/history', data = payload)
print(r.text)
tables = pd.read_html(r.text)
table = tables[0]
输出:
print (table)
0 ... 4
0 Year - Organization ... City
1 NaN ... NaN
2 2018 Treble Choir ... El Paso
3 2018 Treble Choir ... Flower Mound
4 2018 Treble Choir ... Helotes
5 2018 Treble Choir ... Canyon
6 2018 Treble Choir ... Mission
7 2018 Treble Choir ... Belton
8 2018 Treble Choir ... Mansfield
9 2018 Treble Choir ... Wylie
10 2018 Treble Choir ... El Paso
11 2018 Treble Choir ... San Antonio
12 2018 Treble Choir ... Beeville
13 2018 Treble Choir ... Grand Prairie
14 2018 Treble Choir ... San Antonio
15 2018 Treble Choir ... Brownsville
16 2018 Treble Choir ... Houston
17 2018 Treble Choir ... Woodway
18 2018 Treble Choir ... Katy
19 2018 Treble Choir ... Canyon
20 2018 Treble Choir ... Crowley
21 2018 Treble Choir ... Trophy Club
22 2018 Treble Choir ... Amarillo
23 2018 Treble Choir ... Deer Park
24 2018 Treble Choir ... Dallas
25 2018 Treble Choir ... Brownsville
26 2018 Treble Choir ... Houston
27 2018 Treble Choir ... Carrollton
28 2018 Treble Choir ... Plano
29 2018 Treble Choir ... Helotes
.. ... ... ...
140 2018 Treble Choir ... Austin
141 2018 Treble Choir ... Hurst
142 2018 Treble Choir ... League City
143 2018 Treble Choir ... Odessa
144 2018 Treble Choir ... Heath
145 2018 Treble Choir ... Cedar Park
146 2018 Treble Choir ... Jersey Village
147 2018 Treble Choir ... Harlingen
148 2018 Treble Choir ... Grand Prairie
149 2018 Treble Choir ... Coppell
150 2018 Treble Choir ... Lubbock
151 2018 Treble Choir ... The Woodlands
152 2018 Treble Choir ... Laredo
153 2018 Treble Choir ... Sachse
154 2018 Treble Choir ... Pearland
155 2018 Treble Choir ... San Antonio
156 2018 Treble Choir ... Conroe
157 2018 Treble Choir ... Dallas
158 2018 Treble Choir ... Arlington
159 2018 Treble Choir ... Pearland
160 2018 Treble Choir ... Klein
161 2018 Treble Choir ... Houston
162 2018 Treble Choir ... Keller
163 2018 Treble Choir ... Houston
164 2018 Treble Choir ... Fort Worth
165 2018 Treble Choir ... Humble
166 2018 Treble Choir ... Deer Park
167 2018 Treble Choir ... Houston
168 2018 Treble Choir ... Magnolia
169 2018 Treble Choir ... Katy
[170 rows x 5 columns]