我已经下载了一些我想解析的HTML文件。我能够解析文件,但是现在我想列出一些列表,以便可以绘制散点图。我是Python的新手,所以不确定如何将它们放入列表。
我尝试设置一个变量,该变量等于从列中获取的文本。
for y in range (1977, 2020, 1):
tmp = random.random()*5.0
print ('Sleep for ', tmp, ' seconds')
time.sleep(tmp)
url = 'https://www.basketball-reference.com/teams/IND/'+ str(y) +'_games.html'
print ('Download from :', url)
#dowlnload
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(req).read()
fileout = 'YEARS/'+str(y)+'.html'
print ('Save to : ', fileout, '\n')
#save file to disk
f = open(fileout,'w')
f.write(html.decode('utf-8'))
f.close()
#parse
for year in range (1977, 2019, 1):
filein = 'YEARS/' + str(year) + '.html'
soup = BeautifulSoup(open(filein), 'lxml')
entries = soup.find_all('tr', attrs={'class' : ''})
for entry in entries:
#print entry
columns = entry.find_all('td')
if len (columns)>4 :
#print ('C0: ', columns[4])
where = columns[4].get_text()
#print ('C1: ', columns[5])
opponent = columns[5].get_text()
#print ('C2: ', columns[6])
WL = columns[6].get_text()
#print ('C3: ', columns[8])
PacerScore = columns[8].get_text()
#print ('C4: ', columns[9])
OpponentScore = columns[9].get_text()
tt = where+'|::|'+opponent+'|::|'+WL+'|::|'+PacerScore+'|::|'+OpponentScore
print (tt)
x = PacerScore
y = OpponentScore
plt.scatter(x, y, s=area, c=colors, alpha=0.5)
plt.show()
我也尝试使用pandas的read_html,但是我搞砸了,无法正常工作。它一直告诉我找不到功能。
#parse
for y in range (1977, 2019, 1):
filein = 'YEARS/' + str(y) + '.html'
soup = BeautifulSoup(open(filein), 'r')
table = BeautifulSoup(open('YEARS/' + str(y) + '.html','r').read()).find('table')
df = pd.read_html(table)
任何建议或指示,将不胜感激。
答案 0 :(得分:0)
如果您使用的是熊猫的.read_html()
,则无需使用beautifulsoup查找表格标签。熊猫为您做到了。您还需要做大量工作,首先保存html,然后解析html。为什么不直接解析html,然后如果需要的话,只保存该表?
然后您可以使用表格进行绘制。
import requests
import pandas as pd
import numpy as np
import time
import random
headers={'User-Agent': 'Mozilla/5.0'}
for year in range (1977, 2020, 1):
tmp = random.random()*5.0
print ('Sleep for ', tmp, ' seconds')
time.sleep(tmp)
url = 'https://www.basketball-reference.com/teams/IND/'+ str(year) +'_games.html'
response = requests.get(url, headers=headers)
tables = pd.read_html(url)
table = tables[0]
table = table[table['G'] != 'G']
table = table[['Unnamed: 5', 'Opponent','Unnamed: 7','Tm','Opp']]
table.columns = ['Where','Opponent','WL','PacerScore','OpponentScore']
table['Where'] = np.where(table.Where == '@', 'Away', 'Home')
print ('Download table from :', url)
table.to_csv('YEARS/' + str(year) + '.csv')
您的表将如下所示,您可以执行以下操作:
x = table['PacerScore']
y = table['OpponentScore']
获取散点图的x和y值。
输出:
print (table.to_string())
Where Opponent WL PacerScore OpponentScore Season
0 Home Memphis Grizzlies W 111 83 2019
1 Away Milwaukee Bucks L 101 118 2019
2 Home Brooklyn Nets W 132 112 2019
3 Away Minnesota Timberwolves L 91 101 2019
4 Away San Antonio Spurs W 116 96 2019
5 Away Cleveland Cavaliers W 119 107 2019
6 Home Portland Trail Blazers L 93 103 2019
7 Away New York Knicks W 107 101 2019
8 Away Chicago Bulls W 107 105 2019
9 Home Boston Celtics W 102 101 2019
10 Home Houston Rockets L 94 98 2019
11 Home Philadelphia 76ers L 94 100 2019
12 Away Miami Heat W 110 102 2019
13 Away Houston Rockets L 103 115 2019
14 Home Miami Heat W 99 91 2019
15 Home Atlanta Hawks W 97 89 2019
16 Home Utah Jazz W 121 94 2019
17 Away Charlotte Hornets L 109 127 2019
18 Home San Antonio Spurs L 100 111 2019
19 Away Utah Jazz W 121 88 2019
21 Away Phoenix Suns W 109 104 2019
22 Away Los Angeles Lakers L 96 104 2019
23 Away Sacramento Kings L 110 111 2019
24 Home Chicago Bulls W 96 90 2019
25 Away Orlando Magic W 112 90 2019
26 Home Sacramento Kings W 107 97 2019
27 Home Washington Wizards W 109 101 2019
28 Home Milwaukee Bucks W 113 97 2019
29 Away Philadelphia 76ers W 113 101 2019
30 Home New York Knicks W 110 99 2019
31 Home Cleveland Cavaliers L 91 92 2019
32 Away Toronto Raptors L 96 99 2019
33 Away Brooklyn Nets W 114 106 2019
34 Home Washington Wizards W 105 89 2019
35 Away Atlanta Hawks W 129 121 2019
36 Home Detroit Pistons W 125 88 2019
37 Home Atlanta Hawks W 116 108 2019
38 Away Chicago Bulls W 119 116 2019
39 Away Toronto Raptors L 105 121 2019
40 Away Cleveland Cavaliers W 123 115 2019
42 Away Boston Celtics L 108 135 2019
43 Away New York Knicks W 121 106 2019
44 Home Phoenix Suns W 131 97 2019
45 Home Philadelphia 76ers L 96 120 2019
46 Home Dallas Mavericks W 111 99 2019
47 Home Charlotte Hornets W 120 95 2019
48 Home Toronto Raptors W 110 106 2019
49 Away Memphis Grizzlies L 103 106 2019
50 Home Golden State Warriors L 100 132 2019
51 Away Washington Wizards L 89 107 2019
52 Away Orlando Magic L 100 107 2019
53 Away Miami Heat W 95 88 2019
54 Away New Orleans Pelicans W 109 107 2019
55 Home Los Angeles Lakers W 136 94 2019
56 Home Los Angeles Clippers W 116 92 2019
57 Home Cleveland Cavaliers W 105 90 2019
58 Home Charlotte Hornets W 99 90 2019
59 Home Milwaukee Bucks L 97 106 2019
60 Home New Orleans Pelicans W 126 111 2019
61 Away Washington Wizards W 119 112 2019
63 Away Detroit Pistons L 109 113 2019
64 Away Dallas Mavericks L 101 110 2019
65 Home Minnesota Timberwolves W 122 115 2019
66 Home Orlando Magic L 112 117 2019
67 Home Chicago Bulls W 105 96 2019
68 Away Milwaukee Bucks L 98 117 2019
69 Away Philadelphia 76ers L 89 106 2019
70 Home New York Knicks W 103 98 2019
71 Home Oklahoma City Thunder W 108 106 2019
72 Away Denver Nuggets L 100 102 2019
73 Away Portland Trail Blazers L 98 106 2019
74 Away Los Angeles Clippers L 109 115 2019
75 Away Golden State Warriors L 89 112 2019
76 Home Denver Nuggets NaN NaN NaN 2019
77 Away Oklahoma City Thunder NaN NaN NaN 2019
78 Away Boston Celtics NaN NaN NaN 2019
79 Home Orlando Magic NaN NaN NaN 2019
80 Home Detroit Pistons NaN NaN NaN 2019
81 Away Detroit Pistons NaN NaN NaN 2019
82 Home Boston Celtics NaN NaN NaN 2019
84 Home Brooklyn Nets NaN NaN NaN 2019
85 Away Atlanta Hawks NaN NaN NaN 2019