url = 'https://www.baseball-reference.com/boxes/NYN/NYN201704030.shtml'
def make_soup(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
return soup
soup = make_soup(url)
我正试图通过该页面上的播放表找到该剧,我已经用尽了所有选项。有关如何定位的任何想法?
这是位于div.table_outer_container.mobile_table
下的 tbody答案 0 :(得分:0)
您可以将Selenium与BeautifulSoup结合使用来抓取该表内容,如下所示:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("https://www.baseball-reference.com/boxes/NYN/NYN201704030.shtml")
html = driver.page_source
soup = BeautifulSoup(html, "lxml")
pbp_table = soup.find_all("table", {"id":"play_by_play"})
for tag in pbp_table:
print (tag.text)
如果您想使用此代码,请务必查看Selenium guide on drivers并下载最新的geckodriver,如果您使用的是上述代码中的Firefox。
答案 1 :(得分:0)
它在来源中被注释掉了:
查找标识评论的内容,即play_by_play
id
from requests import get
from bs4 import BeautifulSoup, Comment
cont = get("https://www.baseball-reference.com/boxes/NYN/NYN201704030.shtml").content
soup = BeautifulSoup(cont, "lxml")
# Search Comments
comment = soup.find(text=lambda n: isinstance(n, Comment) and 'id="play_by_play"' in n)
soup2 = BeautifulSoup(comment)
table = soup2.select("#play_by_play")[0]
哪个得到了你想要的东西:
In [3]: from requests import get
...: from bs4 import BeautifulSoup, Comment
...: cont = get("https://www.baseball-reference.com/boxes/NYN/NYN201704030.sh
...: tml").content
...: soup = BeautifulSoup(cont, "lxml")
...: comment = soup.find(text=lambda n: isinstance(n, Comment) and 'id="pla
...: y_by_play"' in n)
...: soup2 = BeautifulSoup(comment, "lxml")
...: table = soup2.select("#play_by_play")[0]
...: print(table.select_one(".pbp_summary_top").text)
...:
Top of the 1st, Braves Batting, Tied 0-0, Mets' Noah Syndergaard facing 1-2-3
In [4]:
您还可以使用text=...
的正则表达式:
cont = get("https://www.baseball-reference.com/boxes/NYN/NYN201704030.shtml").content
soup = BeautifulSoup(cont, "lxml")
comment = soup.find(text=compile('id="play_by_play"'))
soup2 = BeautifulSoup(comment, "lxml")
table = soup2.select("#play_by_play")[0]