我正在尝试从一个网页(A)抓取数据点,然后从每个单独的数据点自己的网页抓取数据,并将所有数据合并到一个数据框中,以便于查看。
这是一个包含四列的每日数据框:团队,投手,ERA,WHIP。 ERA和WHIP位于特定投手的URL中。对于以下数据,我设法刮过球队名称和起始投手名称,并将两者都组织到一个数据框中(尽管有误)。
console.log('before exe invocation') // I see this log
await exec(`${this.exeFileName} "${inputFileName}" "${outputFileName}"`, {
cwd: this.tmpDir
}) // I don't see any logs being emitted from within the exe
// return the contents of the generated output file
const outputFilePath = path.resolve(this.tmpDir, outputFileName)
const result = await readFile(outputFilePath, 'utf8')
console.log('exe output:', result) //I see this log
我想添加代码来跟踪每个投手的网页,抓取ERA和WHIP,然后将数据修改为与团队和投手名称相同的数据框。这有可能吗?
到目前为止的输出:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
targetUrl = 'http://www.baseball-reference.com/previews/'
targetUrl_response = requests.get(targetUrl, timeout=5)
soup = BeautifulSoup(targetUrl_response.content, "html.parser")
teams = []
pitchers = []
for i in soup.find_all('tr'):
if i.find_all('strong'):
for link in i.find_all('strong'):
if not re.findall(r'MLB Debut',link.text):
teams.append(link.text)
if i.find_all('a'):
for link in i.find_all('a'):
if not re.findall(r'Preview',link.text):
pitchers.append(link.text)
print (df)
答案 0 :(得分:0)
没什么事(请参阅我在那儿所做的:-))sports-reference.com页面是动态的。您可以直接拉出某些表,但是如果有多个表,则可以在html源代码的注释标记下找到它们。因此,如果您想从页面中获取更多数据,那么以后可能会遇到问题。
第二件事是我注意到您正在拉<tr>
标签,这意味着有<table>
标签,并且熊猫可以为您做繁重的工作,而不是通过bs4进行遍历。这是一个简单的pd.read_html()
函数。但是,它不会拉出那些链接,仅严格地是文本。因此,在这种情况下,使用BeautifulSoup进行迭代是可行的方法(我只是在提及它以供将来参考)。
还有很多工作要做,因为其中几个没有链接/返回时代或鞭子。而且您还必须考虑一个人是否被交易或改变联赛,同一2019赛季可能会有多个ERA。但这应该可以帮助您:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
targetUrl = 'http://www.baseball-reference.com/previews/'
targetUrl_response = requests.get(targetUrl, timeout=5)
soup = BeautifulSoup(targetUrl_response.content, "html.parser")
teams = []
pitchers = []
era_list = []
whip_list = []
for i in soup.find_all('tr'):
if i.find_all('strong'):
for link in i.find_all('strong'):
if not re.findall(r'MLB Debut',link.text):
teams.append(link.text)
if i.find_all('a'):
for link in i.find_all('a'):
if not re.findall(r'Preview',link.text):
try:
url_link = link['href']
pitcher_table = pd.read_html(url_link)[0]
pitcher_table = pitcher_table[(pitcher_table['Year'] == '2019') & (pitcher_table['Lg'].isin(['AL', 'NL']))]
era = round(pitcher_table.iloc[0]['ERA'],2)
whip = round(pitcher_table.iloc[0]['WHIP'],2)
except:
era = 'N/A'
whip = 'N/A'
pitchers.append(link.text)
era_list.append(era)
whip_list.append(whip)
print ('%s\tERA: %s\tWHIP: %s' %(link.text, era, whip))
df = pd.DataFrame(list(zip(pitchers, teams, era_list, whip_list)), columns = ['Pitcher', ',Team', 'ERA', 'WHIP'])
print (df)
输出:
print (df)
Pitcher Team ERA WHIP
0 Walker Lockett NYM 23.14 2.57
1 Jake Arrieta PHI 4.12 1.38
2 Logan Allen SDP 0 0.71
3 Jimmy Yacabonis BAL 4.7 1.44
4 Clayton Richard TOR 7.46 1.74
5 Glenn Sparkman KCR 3.62 1.25
6 Shane Bieber CLE 3.86 1.08
7 Carson Fulmer CHW 6.35 1.94
8 David Price BOS 3.39 1.1
9 Jesse Chavez TEX N/A N/A
10 Jordan Zimmermann DET 6.03 1.37
11 Max Scherzer WSN 2.62 1.06
12 Trevor Richards MIA 3.54 1.25
13 Max Fried ATL 4.03 1.34
14 Adbert Alzolay CHC 2.25 0.75
15 Marco Gonzales SEA 4.38 1.37
16 Zach Davies MIL 3.06 1.36
17 Trevor Williams PIT 4.12 1.19
18 Gerrit Cole HOU 3.54 1.02
19 Blake Snell TBR 4.4 1.24
20 Kyle Gibson MIN 4.18 1.25
21 Chris Bassitt OAK 3.64 1.17
22 Jack Flaherty STL 4.24 1.18
23 Ross Stripling LAD 3.08 1.17
24 Robbie Ray ARI 3.87 1.34
25 Chi Chi Gonzalez COL N/A N/A
26 Madison Bumgarner SFG 4.28 1.24
27 Tyler Mahle CIN 4.17 1.2
28 Andrew Heaney LAA 5.68 1.14