使用beautifulsoup获取CGI信息

时间:2016-09-19 22:21:20

标签: python-2.7 beautifulsoup cgi

关于去哪儿,我有点迷失,以为你可能会有所帮助。我试图从以下搜索表单中自动抓取数据

http://www.pro-football-reference.com/play-index/play_finder.cgi

为此,我在匹配感兴趣的请求网址的字符串上运行BeautifulSoup,如下所示:

url=http://www.pro-football-reference.com/play-index/play_finder.cgi?request=1&super_bowl=0&match=all&year_min=2015 &year_max=2015 &game_type=R&game_num_min=0&game_num_max=99&week_num_min=0&week_num_max=99&quarter=1&tr_gtlt=lt&minutes=15&seconds=00&down=1&yg_gtlt=eq&yards=-5&is_first_down=-1&field_pos_min_field=team&field_pos_max_field=team&end_field_pos_min_field=team&end_field_pos_max_field=team&type=RUSH&is_complete=-1&is_turnover=-1&turnover_type=interception&turnover_type=fumble&is_scoring=-1&score_type=touchdown&score_type=field_goal&score_type=safety&is_sack=-1&include_kneels=-1&no_play=0&order_by=yards&more_options=1&rush_direction=LT&pass_location=DL&
BeautifulSoup(url)

我确定字符串是正确的,因为我已经多次将字符串插入浏览器并看到必要的表(这里,个人播放id =&#34的表; all_plays"),但是我在网址上运行BeautifulSoup,它错过了标记:

< html的><。体>< .P> http://www.pro-football-reference.com/play-index/play_finder.cgi?request=1&super_bowl=0&match=all&year_min=2015&year_max=2015&game_type=R&game_num_min=0&game_num_max=99&week_num_min=0&week_num_max=99&quarter=1&tr_gtlt=lt&minutes=15&seconds=00&down=1&yg_gtlt=eq&yards=-5&is_first_down=-1&field_pos_min_field=team&field_pos_max_field=team&end_field_pos_min_field=team&end_field_pos_max_field=team&type=RUSH&is_complete=-1&is_turnover=-1&turnover_type=interception&turnover_type=fumble&is_scoring=-1&score_type=touchdown&score_type=field_goal&score_type=safety&is_sack=-1&include_kneels=-1&no_play=0&order_by=yards&more_options=1&rush_direction=LT&pass_location=DL< ./ p为H.< ./体>< ./ HTML>

(我在每个html标记中添加了句点,以便它不会尝试格式化BeautifulSoup附加的标记。)

我假设这与远程cgi的请求有关,但我没有任何答案。

有没有办法可以使用beautifulsoup来抢桌子?否则我会转向selenium驱动程序并可能以这种方式自动化,但这是最后一种选择。

谢谢!

1 个答案:

答案 0 :(得分:1)

首先,您目前要求BeautifulSoup 将您的网址解析为字符串BeautifulSoup是一个HTML解析器,而不是HTTP请求库,它会在这种情况下向URL发出请求。

相反,请使用requests包来发送请求并使用BeautifulSoup解析响应:

import requests
from bs4 import BeautifulSoup

url = 'http://www.pro-football-reference.com/play-index/play_finder.cgi'

with requests.Session() as session:
    session.headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.113 Safari/537.36"}
    session.get(url)

    response = session.get(url + "?request=1&super_bowl=0&match=all&year_min=2015 &year_max=2015 &game_type=R&game_num_min=0&game_num_max=99&week_num_min=0&week_num_max=99&quarter=1&tr_gtlt=lt&minutes=15&seconds=00&down=1&yg_gtlt=eq&yards=-5&is_first_down=-1&field_pos_min_field=team&field_pos_max_field=team&end_field_pos_min_field=team&end_field_pos_max_field=team&type=RUSH&is_complete=-1&is_turnover=-1&turnover_type=interception&turnover_type=fumble&is_scoring=-1&score_type=touchdown&score_type=field_goal&score_type=safety&is_sack=-1&include_kneels=-1&no_play=0&order_by=yards&more_options=1&rush_direction=LT&pass_location=DL&")

    soup = BeautifulSoup(response.content, "html.parser")
    for row in soup.select("#div_down table tr")[1:]:
        print([cell.get_text() for cell in row.find_all("td")])

这将打印第一个“向下”表数据行:

['1', '100.0%']