我为2018年美国职业棒球大联盟的投手们刮了一下。我想将各种类别转换成一个数据框,以便可以将其打印成Excel。我想用熊猫。这是我目前的代码:
from urllib.request import urlopen
from lxml.html import fromstring
url = "https://www.baseball-reference.com/leagues/MLB/2018-standard-pitching.shtml"
#remove HTML comment markup
content = str(urlopen(url).read())
comment = content.replace("-->","").replace("<!--","")
tree = fromstring(comment)
for pitcher_row in tree.xpath('//table[contains(@class,"stats_table")]//tr[contains(@class,"full_table")]'):
names = pitcher_row.xpath('.//td[@data-stat="player"]/a')[0].text
age = pitcher_row.xpath('.//td[@data-stat="age"]/text()')[0]
w = pitcher_row.xpath('.//td[@data-stat="W"]/text()')[0]
l = pitcher_row.xpath('.//td[@data-stat="L"]/text()')[0]
g = pitcher_row.xpath('.//td[@data-stat="G"]/text()')[0]
gs = pitcher_row.xpath('.//td[@data-stat="GS"]/text()')[0]
ip = pitcher_row.xpath('.//td[@data-stat="IP"]/text()')[0]
hits = pitcher_row.xpath('.//td[@data-stat="H"]/text()')[0]
runs = pitcher_row.xpath('.//td[@data-stat="R"]/text()')[0]
bb = pitcher_row.xpath('.//td[@data-stat="BB"]/text()')[0]
so = pitcher_row.xpath('.//td[@data-stat="SO"]/text()')[0]
#print data
print(names, age, w, l, g, gs, ip, hits, runs, bb, so)
我想用刮擦创建一个数据框。有谁知道该怎么做?
我看到了有关如何在https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html上创建数据框的说明,但是,我不知道如何将其应用于我的情况。
下面是一个示例:
>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df
不过,我想使用我的数据。不知道是否需要附加数据。
谢谢!
答案 0 :(得分:2)
如何实例化一个空数据框并按行追加您的抓取数据:
columns = ("names", "age", "w", "l", "g", "gs", "ip", "hits", "runs", "bb", "so")
df = pd.DataFrame(columns=columns)
for idx, pitcher_row in enumerate(tree.xpath('//table[contains(@class,"stats_table")]//tr[contains(@class,"full_table")]')):
tmp = []
tmp.append(pitcher_row.xpath('.//td[@data-stat="player"]/a')[0].text)
tmp.append(pitcher_row.xpath('.//td[@data-stat="age"]/text()')[0])
tmp.append(pitcher_row.xpath('.//td[@data-stat="W"]/text()')[0])
...
df.loc[idx] = tmp
或者,如果您想坚持使用大多数代码,甚至更简单:
columns = ("names", "age", "w", "l", "g", "gs", "ip", "hits", "runs", "bb", "so")
df = pd.DataFrame(columns=columns)
for idx, pitcher_row in enumerate(tree.xpath('//table[contains(@class,"stats_table")]//tr[contains(@class,"full_table")]')):
names = pitcher_row.xpath('.//td[@data-stat="player"]/a')[0].text
age = pitcher_row.xpath('.//td[@data-stat="age"]/text()')[0]
w = pitcher_row.xpath('.//td[@data-stat="W"]/text()')[0]
l = pitcher_row.xpath('.//td[@data-stat="L"]/text()')[0]
g = pitcher_row.xpath('.//td[@data-stat="G"]/text()')[0]
gs = pitcher_row.xpath('.//td[@data-stat="GS"]/text()')[0]
ip = pitcher_row.xpath('.//td[@data-stat="IP"]/text()')[0]
hits = pitcher_row.xpath('.//td[@data-stat="H"]/text()')[0]
runs = pitcher_row.xpath('.//td[@data-stat="R"]/text()')[0]
bb = pitcher_row.xpath('.//td[@data-stat="BB"]/text()')[0]
so = pitcher_row.xpath('.//td[@data-stat="SO"]/text()')[0]
df.loc[idx] = (names, age, w, l, g, gs, ip, hits, runs, bb, so)