从棒球参考中解析bs4表

时间:2018-05-04 03:11:31

标签: beautifulsoup

url = 'https://www.baseball-reference.com/boxes/NYN/NYN201704030.shtml'

def make_soup(url): 
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    return soup

soup = make_soup(url)

我正试图通过该页面上的播放表找到该剧,我已经用尽了所有选项。有关如何定位的任何想法?

这是位于div.table_outer_container.mobile_table

下的 tbody

2 个答案:

答案 0 :(得分:0)

您可以将Selenium与BeautifulSoup结合使用来抓取该表内容,如下所示:

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Firefox()
driver.get("https://www.baseball-reference.com/boxes/NYN/NYN201704030.shtml")

html = driver.page_source
soup = BeautifulSoup(html, "lxml")

pbp_table = soup.find_all("table", {"id":"play_by_play"})

for tag in pbp_table:
    print (tag.text)

如果您想使用此代码,请务必查看Selenium guide on drivers并下载最新的geckodriver,如果您使用的是上述代码中的Firefox。

答案 1 :(得分:0)

它在来源中被注释掉了:

enter image description here

查找标识评论的内容,即play_by_play id

from requests import get
from bs4 import BeautifulSoup, Comment


cont = get("https://www.baseball-reference.com/boxes/NYN/NYN201704030.shtml").content
soup = BeautifulSoup(cont, "lxml")

# Search Comments
comment = soup.find(text=lambda n: isinstance(n, Comment) and 'id="play_by_play"' in n)

soup2 = BeautifulSoup(comment)
table = soup2.select("#play_by_play")[0]

哪个得到了你想要的东西:

In [3]: from requests import get
   ...: from bs4 import BeautifulSoup, Comment
   ...: cont = get("https://www.baseball-reference.com/boxes/NYN/NYN201704030.sh
   ...: tml").content
   ...: soup = BeautifulSoup(cont, "lxml")
   ...: comment = soup.find(text=lambda n: isinstance(n, Comment) and 'id="pla
   ...: y_by_play"' in n)
   ...: soup2 = BeautifulSoup(comment, "lxml")
   ...: table = soup2.select("#play_by_play")[0]
   ...: print(table.select_one(".pbp_summary_top").text)
   ...: 
Top of the 1st, Braves Batting, Tied 0-0, Mets' Noah Syndergaard facing 1-2-3 

In [4]: 

您还可以使用text=...的正则表达式:

cont = get("https://www.baseball-reference.com/boxes/NYN/NYN201704030.shtml").content
soup = BeautifulSoup(cont, "lxml")
comment = soup.find(text=compile('id="play_by_play"'))
soup2 = BeautifulSoup(comment, "lxml")
table = soup2.select("#play_by_play")[0]