Python使用正则表达式从网站解析数据

时间:2015-06-03 00:11:45

标签: python

我正在尝试解析此网站的一些数据: http://www.csfbl.com/freeagents.asp?leagueid=2237

我写了一些代码:

import urllib
import re

name = re.compile('<td><a href="[^"]+" onclick="[^"]+">(.+?)</a>')
player_id = re.compile('<td><a href="(.+?)" onclick=')
#player_id_num = re.compile('<td><a href=player.asp?playerid="(.+?)" onclick=')
stat_c = re.compile('<td class="[^"]+" align="[^"]+"><span class="[^"]?">(.+?)</span><br><span class="[^"]?">')
stat_p = re.compile('<td class="[^"]+" align="[^"]+"><span class="[^"]?">"[^"]+"</span><br><span class="[^"]?">(.+?)</span></td>')

url = 'http://www.csfbl.com/freeagents.asp?leagueid=2237'

sock = urllib.request.urlopen(url).read().decode("utf-8")

#li = name.findall(sock)
name = name.findall(sock)
player_id = player_id.findall(sock)
#player_id_num = player_id_num.findall(sock)
#age = age.findall(sock)
stat_c = stat_c.findall(sock)
stat_p = stat_p.findall(sock)

第一个问题:player_id返回整个网址"player.asp?playerid=4209661"。我无法得到数字部分。我怎样才能做到这一点? (我的尝试在#player_id_num)中描述

第二个问题:如span_class""为空,我无法获取stat_c。

有没有办法解决这些问题?我对RE(正则表达式)不是很熟悉,我在线查阅了教程,但仍然不清楚我做错了什么。

1 个答案:

答案 0 :(得分:1)

使用 pandas 库非常简单。

<强>代码:

import pandas as pd

url = "http://www.csfbl.com/freeagents.asp?leagueid=2237"
dfs = pd.read_html(url)

# print dfs[3]
# dfs[3].to_csv("stats.csv") # Send to a CSV file.
print dfs[3].head()

<强>结果:

    0                 1    2  3      4     5     6     7     8      9     10  \
0  Pos              Name  Age  T     PO    FI    CO    SY    HR     RA    GL   
1    P    George Pacheco   38  R   4858  7484  8090  7888  6777   4353  6979   
2    P     David Montoya   34  R   3944  5976  6673  8699  6267   6685  5459   
3    P       Robert Cole   34  R   5769  7189  7285  5863  6267   5868  5462   
4    P  Juanold McDonald   32  R  69100  5772  4953  4866  5976  67100  5362   

     11    12  13       14      15          16  
0    AR    EN  RL  Fatigue  Salary         NaN  
1  3747  6171  -3     100%     ---  $3,672,000  
2  5257  5975  -4      96%      2%  $2,736,000  
3  4953  5061  -4      96%      3%  $2,401,000  
4  5982  5263  -4     100%     ---  $1,890,000 

您可以从此处应用您想要的任何清洁方法。代码很简陋,所以由你来改进它。

更多代码:

import pandas as pd
import itertools

url = "http://www.csfbl.com/freeagents.asp?leagueid=2237"
dfs = pd.read_html(url)
df = dfs[3] # "First" stats table.

# The first row is the actual header.
# Also, notice the NaN at the end.
header = df.iloc[0][:-1].tolist()
# Fix that atrocity of a last column.
df.drop([15], axis=1, inplace=True)

# Last row is all NaNs. This particular
# table should end with Jeremy Dix.
df = df.iloc[1:-1,:]
df.columns = header
df.reset_index(drop=True, inplace=True)

# Pandas cannot create two rows without the
# dataframe turning into a nightmare. Let's
# try an aesthetic change.
sub_header = header[4:13]
orig = ["{}{}".format(h, "r") for h in sub_header]
clone = ["{}{}".format(h, "p") for h in sub_header]

# http://stackoverflow.com/a/3678930/2548721
comb = [iter(orig), iter(clone)]
comb = list(it.next() for it in itertools.cycle(comb))

# Construct the new header.
new_header = header[0:4]
new_header += comb
new_header += header[13:]

# Slow but does it cleanly.
for s, o, c in zip(sub_header, orig, clone):
    df.loc[:, o] = df[s].apply(lambda x: x[:2])
    df.loc[:, c] = df[s].apply(lambda x: x[2:])

df = df[new_header] # Drop the other columns.

print df.head()

更多结果:

  Pos              Name Age  T POr  POp FIr FIp COr COp     ...      RAp GLr  \
0   P    George Pacheco  38  R  48   58  74  84  80  90     ...       53  69   
1   P     David Montoya  34  R  39   44  59  76  66  73     ...       85  54   
2   P       Robert Cole  34  R  57   69  71  89  72  85     ...       68  54   
3   P  Juanold McDonald  32  R  69  100  57  72  49  53     ...      100  53   
4   P      Trevor White  37  R  61   66  62  64  67  67     ...       38  48   

  GLp ARr  ARp ENr ENp  RL Fatigue      Salary  
0  79  37   47  61  71  -3    100%  $3,672,000  
1  59  52   57  59  75  -4     96%  $2,736,000  
2  62  49   53  50  61  -4     96%  $2,401,000  
3  62  59   82  52  63  -4    100%  $1,890,000  
4  50  70  100  62  69  -4    100%  $1,887,000  

显然,我所做的是将Real值与Potential值分开。使用了一些技巧,但它至少在第一个玩家表中完成了工作。接下来的几个需要一定程度的操纵。