我正在尝试解析此网站的一些数据: http://www.csfbl.com/freeagents.asp?leagueid=2237
我写了一些代码:
import urllib
import re
name = re.compile('<td><a href="[^"]+" onclick="[^"]+">(.+?)</a>')
player_id = re.compile('<td><a href="(.+?)" onclick=')
#player_id_num = re.compile('<td><a href=player.asp?playerid="(.+?)" onclick=')
stat_c = re.compile('<td class="[^"]+" align="[^"]+"><span class="[^"]?">(.+?)</span><br><span class="[^"]?">')
stat_p = re.compile('<td class="[^"]+" align="[^"]+"><span class="[^"]?">"[^"]+"</span><br><span class="[^"]?">(.+?)</span></td>')
url = 'http://www.csfbl.com/freeagents.asp?leagueid=2237'
sock = urllib.request.urlopen(url).read().decode("utf-8")
#li = name.findall(sock)
name = name.findall(sock)
player_id = player_id.findall(sock)
#player_id_num = player_id_num.findall(sock)
#age = age.findall(sock)
stat_c = stat_c.findall(sock)
stat_p = stat_p.findall(sock)
第一个问题:player_id返回整个网址"player.asp?playerid=4209661"
。我无法得到数字部分。我怎样才能做到这一点?
(我的尝试在#player_id_num
)中描述
第二个问题:如span_class
中""
为空,我无法获取stat_c。
有没有办法解决这些问题?我对RE(正则表达式)不是很熟悉,我在线查阅了教程,但仍然不清楚我做错了什么。
答案 0 :(得分:1)
使用 pandas
库非常简单。
<强>代码:强>
import pandas as pd
url = "http://www.csfbl.com/freeagents.asp?leagueid=2237"
dfs = pd.read_html(url)
# print dfs[3]
# dfs[3].to_csv("stats.csv") # Send to a CSV file.
print dfs[3].head()
<强>结果:强>
0 1 2 3 4 5 6 7 8 9 10 \
0 Pos Name Age T PO FI CO SY HR RA GL
1 P George Pacheco 38 R 4858 7484 8090 7888 6777 4353 6979
2 P David Montoya 34 R 3944 5976 6673 8699 6267 6685 5459
3 P Robert Cole 34 R 5769 7189 7285 5863 6267 5868 5462
4 P Juanold McDonald 32 R 69100 5772 4953 4866 5976 67100 5362
11 12 13 14 15 16
0 AR EN RL Fatigue Salary NaN
1 3747 6171 -3 100% --- $3,672,000
2 5257 5975 -4 96% 2% $2,736,000
3 4953 5061 -4 96% 3% $2,401,000
4 5982 5263 -4 100% --- $1,890,000
您可以从此处应用您想要的任何清洁方法。代码很简陋,所以由你来改进它。
更多代码:
import pandas as pd
import itertools
url = "http://www.csfbl.com/freeagents.asp?leagueid=2237"
dfs = pd.read_html(url)
df = dfs[3] # "First" stats table.
# The first row is the actual header.
# Also, notice the NaN at the end.
header = df.iloc[0][:-1].tolist()
# Fix that atrocity of a last column.
df.drop([15], axis=1, inplace=True)
# Last row is all NaNs. This particular
# table should end with Jeremy Dix.
df = df.iloc[1:-1,:]
df.columns = header
df.reset_index(drop=True, inplace=True)
# Pandas cannot create two rows without the
# dataframe turning into a nightmare. Let's
# try an aesthetic change.
sub_header = header[4:13]
orig = ["{}{}".format(h, "r") for h in sub_header]
clone = ["{}{}".format(h, "p") for h in sub_header]
# http://stackoverflow.com/a/3678930/2548721
comb = [iter(orig), iter(clone)]
comb = list(it.next() for it in itertools.cycle(comb))
# Construct the new header.
new_header = header[0:4]
new_header += comb
new_header += header[13:]
# Slow but does it cleanly.
for s, o, c in zip(sub_header, orig, clone):
df.loc[:, o] = df[s].apply(lambda x: x[:2])
df.loc[:, c] = df[s].apply(lambda x: x[2:])
df = df[new_header] # Drop the other columns.
print df.head()
更多结果:
Pos Name Age T POr POp FIr FIp COr COp ... RAp GLr \
0 P George Pacheco 38 R 48 58 74 84 80 90 ... 53 69
1 P David Montoya 34 R 39 44 59 76 66 73 ... 85 54
2 P Robert Cole 34 R 57 69 71 89 72 85 ... 68 54
3 P Juanold McDonald 32 R 69 100 57 72 49 53 ... 100 53
4 P Trevor White 37 R 61 66 62 64 67 67 ... 38 48
GLp ARr ARp ENr ENp RL Fatigue Salary
0 79 37 47 61 71 -3 100% $3,672,000
1 59 52 57 59 75 -4 96% $2,736,000
2 62 49 53 50 61 -4 96% $2,401,000
3 62 59 82 52 63 -4 100% $1,890,000
4 50 70 100 62 69 -4 100% $1,887,000
显然,我所做的是将Real值与Potential值分开。使用了一些技巧,但它至少在第一个玩家表中完成了工作。接下来的几个需要一定程度的操纵。