BeautifulSoup web scrape将无法正确地从表的给定列中删除数据。
它可以获取(刮除)表格中的所有数据,除了“播放器”中的数据。柱;输出显示所有玩家名称为“无”#39;。
“播放器”中数据的td元素的唯一区别列与tr中的所有其他td元素相比,在“td”之前有一个href。在播放器数据元素中,如下图所示。
我如何更改代码以获取玩家名称?它是'播放器'中的href?数据搞砸了我的脚本?如果是这样,我如何解释这个?
#HOME_SKATERS
#FIRST_TWO_GAMES
import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
table = []
df = pd.DataFrame()
for i in range (400959564,400959565):
url = requests.get("http://www.espn.com/nhl/boxscore?gameId={}".format(i))
if not url.ok:
continue
data = url.text
soup = BeautifulSoup(data, 'lxml')
#Add the game ID to the list of soups to keep track of multiple players with same game ID
table.append((i,soup.find_all('table', {'class' : 'mod-data'})[5].find_all('tr')[2:20]))
data = []
soups = []
game_id = []
for i,t in table:
#Use .contents method to turn the soup into list of items
soups = [j.contents for j in t]
for s in soups:
#Use .string method to parse the values of different columns
data.append([a.string for a in s])
#Append the Game ID
game_id.append(i)
#Create a DataFrame from the data extracted
df = pd.DataFrame(data)
df.columns = ['Player', 'G', 'A','Plus_Minus', 'SOG', 'MS', 'BS', 'PN', 'PIM', 'HT', 'TK', 'GV', 'SHF', 'TOT', 'PP','SH', 'EV', 'FW', 'FL', 'Faceoff_Pct']
df['Game ID'] = game_id
#df.to_csv('HOME_SKATERS.csv')
df

答案 0 :(得分:0)
变化:
data.append([a.string for a in s])
到
data.append([a.text for a in s])
输出:
Player G A Plus_Minus SOG MS BS PN PIM HT ... GV SHF \
0 J. Armia RW 0 0 -2 0 0 1 0 0 1 ... 0 21
1 D. Byfuglien D 0 1 0 5 1 1 1 2 1 ... 1 29
2 A. Copp C 0 0 -1 1 0 1 0 0 1 ... 0 18
3 M. Dano C 0 0 -1 0 0 0 0 0 0 ... 0 14
4 N. Ehlers LW 0 0 -1 2 1 0 0 0 0 ... 0 20
5 T. Enstrom D 0 0 -2 1 1 2 0 0 1 ... 0 23
6 D. Kulikov D 0 0 1 0 0 0 0 0 1 ... 0 20
7 P. Laine RW 0 1 2 2 1 0 0 0 0 ... 1 23
8 B. Little C 0 1 0 3 2 0 0 0 0 ... 0 22
9 A. Lowry LW 0 0 0 4 1 0 0 0 3 ... 1 27
10 S. Matthias C 0 0 -1 2 0 0 0 0 2 ... 2 23
11 J. Morrissey D 0 0 -2 2 0 2 1 2 0 ... 0 20
12 T. Myers D 0 0 0 2 1 2 1 2 2 ... 1 27
13 M. Perreault C 1 0 -1 2 0 0 0 0 1 ... 0 23
14 M. Scheifele C 1 0 -1 4 0 0 0 0 0 ... 0 23
15 B. Tanev LW 0 0 -1 0 0 0 0 0 1 ... 0 15
16 J. Trouba D 0 0 -4 5 0 4 1 2 4 ... 1 30
17 B. Wheeler RW 0 0 -1 2 2 1 0 0 0 ... 0 23
请参阅Difference between .string and .text BeautifulSoup
一个例子:
from bs4 import BeautifulSoup
data = '<td style="text-align:left;"><a href="http://www.espn.com/nhl/player/_/id/3961/blake-wheeler">B. Wheeler</a> RW</td>'
soup = BeautifulSoup(data, 'lxml')
td = soup.find("td")
print (td.string)
print (td.text)
输出:
None
B. Wheeler RW
因为“td”元素中有“标记”。