关于去哪儿,我有点迷失,以为你可能会有所帮助。我试图从以下搜索表单中自动抓取数据
http://www.pro-football-reference.com/play-index/play_finder.cgi
为此,我在匹配感兴趣的请求网址的字符串上运行BeautifulSoup,如下所示:
url=http://www.pro-football-reference.com/play-index/play_finder.cgi?request=1&super_bowl=0&match=all&year_min=2015 &year_max=2015 &game_type=R&game_num_min=0&game_num_max=99&week_num_min=0&week_num_max=99&quarter=1&tr_gtlt=lt&minutes=15&seconds=00&down=1&yg_gtlt=eq&yards=-5&is_first_down=-1&field_pos_min_field=team&field_pos_max_field=team&end_field_pos_min_field=team&end_field_pos_max_field=team&type=RUSH&is_complete=-1&is_turnover=-1&turnover_type=interception&turnover_type=fumble&is_scoring=-1&score_type=touchdown&score_type=field_goal&score_type=safety&is_sack=-1&include_kneels=-1&no_play=0&order_by=yards&more_options=1&rush_direction=LT&pass_location=DL&
BeautifulSoup(url)
我确定字符串是正确的,因为我已经多次将字符串插入浏览器并看到必要的表(这里,个人播放id =&#34的表; all_plays"),但是我在网址上运行BeautifulSoup,它错过了标记:
< html的><。体>< .P> http://www.pro-football-reference.com/play-index/play_finder.cgi?request=1&super_bowl=0&match=all&year_min=2015&year_max=2015&game_type=R&game_num_min=0&game_num_max=99&week_num_min=0&week_num_max=99&quarter=1&tr_gtlt=lt&minutes=15&seconds=00&down=1&yg_gtlt=eq&yards=-5&is_first_down=-1&field_pos_min_field=team&field_pos_max_field=team&end_field_pos_min_field=team&end_field_pos_max_field=team&type=RUSH&is_complete=-1&is_turnover=-1&turnover_type=interception&turnover_type=fumble&is_scoring=-1&score_type=touchdown&score_type=field_goal&score_type=safety&is_sack=-1&include_kneels=-1&no_play=0&order_by=yards&more_options=1&rush_direction=LT&pass_location=DL< ./ p为H.< ./体>< ./ HTML>
(我在每个html标记中添加了句点,以便它不会尝试格式化BeautifulSoup附加的标记。)
我假设这与远程cgi的请求有关,但我没有任何答案。
有没有办法可以使用beautifulsoup来抢桌子?否则我会转向selenium驱动程序并可能以这种方式自动化,但这是最后一种选择。
谢谢!
答案 0 :(得分:1)
首先,您目前要求BeautifulSoup
将您的网址解析为字符串。 BeautifulSoup
是一个HTML解析器,而不是HTTP请求库,它会在这种情况下向URL发出请求。
相反,请使用requests
包来发送请求并使用BeautifulSoup
解析响应:
import requests
from bs4 import BeautifulSoup
url = 'http://www.pro-football-reference.com/play-index/play_finder.cgi'
with requests.Session() as session:
session.headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.113 Safari/537.36"}
session.get(url)
response = session.get(url + "?request=1&super_bowl=0&match=all&year_min=2015 &year_max=2015 &game_type=R&game_num_min=0&game_num_max=99&week_num_min=0&week_num_max=99&quarter=1&tr_gtlt=lt&minutes=15&seconds=00&down=1&yg_gtlt=eq&yards=-5&is_first_down=-1&field_pos_min_field=team&field_pos_max_field=team&end_field_pos_min_field=team&end_field_pos_max_field=team&type=RUSH&is_complete=-1&is_turnover=-1&turnover_type=interception&turnover_type=fumble&is_scoring=-1&score_type=touchdown&score_type=field_goal&score_type=safety&is_sack=-1&include_kneels=-1&no_play=0&order_by=yards&more_options=1&rush_direction=LT&pass_location=DL&")
soup = BeautifulSoup(response.content, "html.parser")
for row in soup.select("#div_down table tr")[1:]:
print([cell.get_text() for cell in row.find_all("td")])
这将打印第一个“向下”表数据行:
['1', '100.0%']