如何在脚本中循环浏览所有<th>标记以进行网络抓取?

时间:2019-12-29 17:56:19

标签: python python-3.x debugging web-scraping beautifulsoup

截至目前,我只得到['1']作为下面我当前代码所打印内容的输出。我想在网站https://www.baseball-reference.com/teams/NYY/2019.shtmlRk列的“团队打击”表上获取1-54。

我将如何修改colNum,使其可以在Rk列中打印1-54?我指出colNum行是因为我觉得问题出在那儿,但我可能错了。

import pandas as pd
import requests
from bs4 import BeautifulSoup

page = requests.get('https://www.baseball-reference.com/teams/NYY/2019.shtml')
soup = BeautifulSoup(page.content, 'html.parser')  # parse as HTML page, this is the source code of the page
week = soup.find(class_='table_outer_container')

items = week.find("thead").get_text() # grabs table headers
th = week.find("th").get_text() # grabs Rk only.

tbody = week.find("tbody")
tr = tbody.find("tr")

thtwo = tr.find("th").get_text()
colNum = [thtwo for thtwo in thtwo]
print(colNum)

1 个答案:

答案 0 :(得分:1)

您提到的最后几行是您的错误。如果我理解正确,则需要“ Rk”列中所有值的列表。为了获得所有行,您必须使用find_all()函数。我对您的代码进行了一些调整,以获取以下各行中第一行的文本:

import pandas as pd
import requests
from bs4 import BeautifulSoup

page = requests.get('https://www.baseball-reference.com/teams/NYY/2019.shtml')
soup = BeautifulSoup(page.content, 'html.parser')
is the source code of the page
week = soup.find(class_='table_outer_container')

items = week.find("thead").get_text()
th = week.find("th").get_text()

tbody = week.find("tbody")
tr = tbody.find_all("tr")
colnum = [row.find("th").get_text() for row in tr]

print(colnum)