我正在尝试使用Python 3将本网站的表格抓取到.csv文件中:2015 NBA National TV Schedule
图表开始于:
Date Teams Network
Oct. 27, 8:00 p.m. ET Cleveland @ Chicago TNT
Oct. 27, 10:30 p.m. ET New Orleans @ Golden State TNT
Oct. 28, 8:00 p.m. ET San Antonio @ Oklahoma City ESPN
Oct. 28, 10:30 p.m. ET Minnesota @ L.A. Lakers ESPN
我正在使用这些软件包:
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
from itertools import groupby
我想要的.csv文件中的输出如下所示:
这是网站上图表到.csv文件中的前四行。注意如何多次使用多个日期,并且时间在单独的列中。如何实施刮板以获得此输出?
答案 0 :(得分:2)
pd.read_html
可以达到大部分目的:
In [73]: pd.read_html("https://deadspin.com/nba-national-tv-espn-tnt-abc-nba-tv-1723767952")[0]
Out[73]:
0 1 2
0 Date Teams Network
1 Oct. 27, 8:00 p.m. ET Cleveland @ Chicago TNT
2 Oct. 27, 10:30 p.m. ET New Orleans @ Golden State TNT
3 Oct. 28, 8:00 p.m. ET San Antonio @ Oklahoma City ESPN
4 Oct. 28, 10:30 p.m. ET Minnesota @ L.A. Lakers ESPN
.. ... ... ...
139 Apr. 9, 8:30 p.m. ET Cleveland @ Chicago ABC
140 Apr. 12, 8:00 p.m. ET Oklahoma City @ San Antonio TNT
141 Apr. 12, 10:30 p.m. ET Memphis @ L.A. Clippers TNT
142 Apr. 13, 8:00 p.m. ET Orlando @ Charlotte ESPN
143 Apr. 13, 10:30 p.m. ET Utah @ L.A. Lakers ESPN
您只需要将日期解析为各列并将团队分开即可。
答案 1 :(得分:1)
您将使用pandas来.read_html()
来抓取表格,然后继续使用pandas来操作数据:
import pandas as pd
import numpy as np
df = pd.read_html('https://deadspin.com/nba-national-tv-espn-tnt-abc-nba-tv-1723767952', header=0)[0]
# Split the Date column at the comma into Date, Time columns
df[['Date','Time']] = df.Date.str.split(',',expand=True)
# Replace substrings in Time column
df['Time'] = df['Time'].str.replace('p.m. ET','PM')
# Can't convert to datetime as there is no year. One way to do it here is anything before
# Jan, add the suffix ', 2015', else add ', 2016'
# If you have more than 1 seasin, would have to work this out another way
df['Date'] = np.where(df.Date.str.startswith(('Oct.', 'Nov.', 'Dec.')), df.Date + ', 2015', df.Date + ', 2016')
# If you want 0 padding for the day, remove '#' from %#d below
# Change the date format from abbreviated month to full name (Ie Oct. -> October)
df['Date'] = pd.to_datetime(df['Date'].astype(str)).dt.strftime('%B %#d, %Y')
# Split the Teams column
df[['Team 1','Team 2']] = df.Teams.str.split('@',expand=True)
# Remove any leading/trailing whitespace
df= df.applymap(lambda x: x.strip() if type(x) is str else x)
# Final dataframe with desired columns
df = df[['Date','Time','Team 1','Team 2','Network']]
输出:
Date Time Team 1 Team 2 Network
0 October 27, 2015 8:00 PM Cleveland Chicago TNT
1 October 27, 2015 10:30 PM New Orleans Golden State TNT
2 October 28, 2015 8:00 PM San Antonio Oklahoma City ESPN
3 October 28, 2015 10:30 PM Minnesota L.A. Lakers ESPN
4 October 29, 2015 8:00 PM Atlanta New York TNT
5 October 29, 2015 10:30 PM Dallas L.A. Clippers TNT
6 October 30, 2015 7:00 PM Miami Cleveland ESPN
7 October 30, 2015 9:30 PM Golden State Houston ESPN
8 November 4, 2015 8:00 PM New York Cleveland ESPN
9 November 4, 2015 10:30 PM L.A. Clippers Golden State ESPN
10 November 5, 2015 8:00 PM Oklahoma City Chicago TNT
11 November 5, 2015 10:30 PM Memphis Portland TNT
12 November 6, 2015 8:00 PM Miami Indiana ESPN
13 November 6, 2015 10:30 PM Houston Sacramento ESPN
14 November 11, 2015 8:00 PM L.A. Clippers Dallas ESPN
15 November 11, 2015 10:30 PM San Antonio Portland ESPN
16 November 12, 2015 8:00 PM Golden State Minnesota TNT
17 November 12, 2015 10:30 PM L.A. Clippers Phoenix TNT
18 November 18, 2015 8:00 PM New Orleans Oklahoma City ESPN
19 November 18, 2015 10:30 PM Chicago Phoenix ESPN
20 November 19, 2015 8:00 PM Milwaukee Cleveland TNT
21 November 19, 2015 10:30 PM Golden State L.A. Clippers TNT
22 November 20, 2015 8:00 PM San Antonio New Orleans ESPN
23 November 20, 2015 10:30 PM Chicago Golden State ESPN
24 November 24, 2015 8:00 PM Boston Atlanta TNT
25 November 24, 2015 10:30 PM L.A. Lakers Golden State TNT
26 December 3, 2015 7:00 PM Oklahoma City Miami TNT
27 December 3, 2015 9:30 PM San Antonio Memphis TNT
28 December 4, 2015 7:00 PM Brooklyn New York ESPN
29 December 4, 2015 9:30 PM Cleveland New Orleans ESPN
.. ... ... ... ... ...
113 March 10, 2016 10:30 PM Cleveland L.A. Lakers TNT
114 March 12, 2016 8:30 PM Oklahoma City San Antonio ABC
115 March 13, 2016 3:30 PM Cleveland L.A. Clippers ABC
116 March 14, 2016 8:00 PM Memphis Houston ESPN
117 March 14, 2016 10:30 PM New Orleans Golden State ESPN
118 March 16, 2016 7:00 PM Oklahoma City Boston ESPN
119 March 16, 2016 9:30 PM L.A. Clippers Houston ESPN
120 March 19, 2016 8:30 PM Golden State San Antonio ABC
121 March 22, 2016 8:00 PM Houston Oklahoma City TNT
122 March 22, 2016 10:30 PM Memphis L.A. Lakers TNT
123 March 23, 2016 8:00 PM Milwaukee Cleveland ESPN
124 March 23, 2016 10:30 PM Dallas Portland ESPN
125 March 29, 2016 8:00 PM Houston Cleveland TNT
126 March 29, 2016 10:30 PM Washington Golden State TNT
127 March 31, 2016 7:00 PM Chicago Houston TNT
128 March 31, 2016 9:30 PM L.A. Clippers Oklahoma City TNT
129 April 1, 2016 8:00 PM Cleveland Atlanta ESPN
130 April 1, 2016 10:30 PM Boston Golden State ESPN
131 April 3, 2016 3:30 PM Oklahoma City Houston ABC
132 April 5, 2016 8:00 PM Chicago Memphis TNT
133 April 5, 2016 10:30 PM L.A. Lakers L.A. Clippers TNT
134 April 6, 2016 7:00 PM Cleveland Indiana ESPN
135 April 6, 2016 9:30 PM Houston Dallas ESPN
136 April 7, 2016 8:00 PM Chicago Miami TNT
137 April 7, 2016 10:30 PM San Antonio Golden State TNT
138 April 9, 2016 8:30 PM Cleveland Chicago ABC
139 April 12, 2016 8:00 PM Oklahoma City San Antonio TNT
140 April 12, 2016 10:30 PM Memphis L.A. Clippers TNT
141 April 13, 2016 8:00 PM Orlando Charlotte ESPN
142 April 13, 2016 10:30 PM Utah L.A. Lakers ESPN
[143 rows x 5 columns]