如何通过网络使用Python抓取表格?

时间:2020-05-21 23:19:33

标签: python pandas web-scraping beautifulsoup screen-scraping

我正在尝试使用Python 3将本网站的表格抓取到.csv文件中:2015 NBA National TV Schedule

图表开始于:

Date                    Teams                       Network

Oct. 27, 8:00 p.m. ET   Cleveland @ Chicago         TNT
Oct. 27, 10:30 p.m. ET  New Orleans @ Golden State  TNT
Oct. 28, 8:00 p.m. ET   San Antonio @ Oklahoma City ESPN
Oct. 28, 10:30 p.m. ET  Minnesota @ L.A. Lakers     ESPN

我正在使用这些软件包:

import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
from itertools import groupby

我想要的.csv文件中的输出如下所示:

Scraped Output in .csv File

这是网站上图表到.csv文件中的前四行。注意如何多次使用多个日期,并且时间在单独的列中。如何实施刮板以获得此输出?

2 个答案:

答案 0 :(得分:2)

pd.read_html可以达到大部分目的:

In [73]: pd.read_html("https://deadspin.com/nba-national-tv-espn-tnt-abc-nba-tv-1723767952")[0]
Out[73]:
                          0                            1        2
0                      Date                        Teams  Network
1     Oct. 27, 8:00 p.m. ET          Cleveland @ Chicago      TNT
2    Oct. 27, 10:30 p.m. ET   New Orleans @ Golden State      TNT
3     Oct. 28, 8:00 p.m. ET  San Antonio @ Oklahoma City     ESPN
4    Oct. 28, 10:30 p.m. ET      Minnesota @ L.A. Lakers     ESPN
..                      ...                          ...      ...
139    Apr. 9, 8:30 p.m. ET          Cleveland @ Chicago      ABC
140   Apr. 12, 8:00 p.m. ET  Oklahoma City @ San Antonio      TNT
141  Apr. 12, 10:30 p.m. ET      Memphis @ L.A. Clippers      TNT
142   Apr. 13, 8:00 p.m. ET          Orlando @ Charlotte     ESPN
143  Apr. 13, 10:30 p.m. ET           Utah @ L.A. Lakers     ESPN

您只需要将日期解析为各列并将团队分开即可。

答案 1 :(得分:1)

您将使用pandas来.read_html()来抓取表格,然后继续使用pandas来操作数据:

import pandas as pd
import numpy as np

df = pd.read_html('https://deadspin.com/nba-national-tv-espn-tnt-abc-nba-tv-1723767952', header=0)[0]

# Split the Date column at the comma into Date, Time columns
df[['Date','Time']] = df.Date.str.split(',',expand=True)   

# Replace substrings in Time column
df['Time'] = df['Time'].str.replace('p.m. ET','PM')

# Can't convert to datetime as there is no year. One way to do it here is anything before
# Jan, add the suffix ', 2015', else add ', 2016'
# If you have more than 1 seasin, would have to work this out another way
df['Date'] = np.where(df.Date.str.startswith(('Oct.', 'Nov.', 'Dec.')), df.Date + ', 2015', df.Date + ', 2016')

# If you want 0 padding for the day, remove '#' from %#d below
# Change the date format from abbreviated month to full name (Ie Oct. -> October)
df['Date'] = pd.to_datetime(df['Date'].astype(str)).dt.strftime('%B %#d, %Y')

# Split the Teams column
df[['Team 1','Team 2']] = df.Teams.str.split('@',expand=True)   

# Remove any leading/trailing whitespace
df= df.applymap(lambda x: x.strip() if type(x) is str else x)

# Final dataframe with desired columns
df = df[['Date','Time','Team 1','Team 2','Network']]

输出:

                  Date      Time         Team 1         Team 2 Network
0     October 27, 2015   8:00 PM      Cleveland        Chicago     TNT
1     October 27, 2015  10:30 PM    New Orleans   Golden State     TNT
2     October 28, 2015   8:00 PM    San Antonio  Oklahoma City    ESPN
3     October 28, 2015  10:30 PM      Minnesota    L.A. Lakers    ESPN
4     October 29, 2015   8:00 PM        Atlanta       New York     TNT
5     October 29, 2015  10:30 PM         Dallas  L.A. Clippers     TNT
6     October 30, 2015   7:00 PM          Miami      Cleveland    ESPN
7     October 30, 2015   9:30 PM   Golden State        Houston    ESPN
8     November 4, 2015   8:00 PM       New York      Cleveland    ESPN
9     November 4, 2015  10:30 PM  L.A. Clippers   Golden State    ESPN
10    November 5, 2015   8:00 PM  Oklahoma City        Chicago     TNT
11    November 5, 2015  10:30 PM        Memphis       Portland     TNT
12    November 6, 2015   8:00 PM          Miami        Indiana    ESPN
13    November 6, 2015  10:30 PM        Houston     Sacramento    ESPN
14   November 11, 2015   8:00 PM  L.A. Clippers         Dallas    ESPN
15   November 11, 2015  10:30 PM    San Antonio       Portland    ESPN
16   November 12, 2015   8:00 PM   Golden State      Minnesota     TNT
17   November 12, 2015  10:30 PM  L.A. Clippers        Phoenix     TNT
18   November 18, 2015   8:00 PM    New Orleans  Oklahoma City    ESPN
19   November 18, 2015  10:30 PM        Chicago        Phoenix    ESPN
20   November 19, 2015   8:00 PM      Milwaukee      Cleveland     TNT
21   November 19, 2015  10:30 PM   Golden State  L.A. Clippers     TNT
22   November 20, 2015   8:00 PM    San Antonio    New Orleans    ESPN
23   November 20, 2015  10:30 PM        Chicago   Golden State    ESPN
24   November 24, 2015   8:00 PM         Boston        Atlanta     TNT
25   November 24, 2015  10:30 PM    L.A. Lakers   Golden State     TNT
26    December 3, 2015   7:00 PM  Oklahoma City          Miami     TNT
27    December 3, 2015   9:30 PM    San Antonio        Memphis     TNT
28    December 4, 2015   7:00 PM       Brooklyn       New York    ESPN
29    December 4, 2015   9:30 PM      Cleveland    New Orleans    ESPN
..                 ...       ...            ...            ...     ...
113     March 10, 2016  10:30 PM      Cleveland    L.A. Lakers     TNT
114     March 12, 2016   8:30 PM  Oklahoma City    San Antonio     ABC
115     March 13, 2016   3:30 PM      Cleveland  L.A. Clippers     ABC
116     March 14, 2016   8:00 PM        Memphis        Houston    ESPN
117     March 14, 2016  10:30 PM    New Orleans   Golden State    ESPN
118     March 16, 2016   7:00 PM  Oklahoma City         Boston    ESPN
119     March 16, 2016   9:30 PM  L.A. Clippers        Houston    ESPN
120     March 19, 2016   8:30 PM   Golden State    San Antonio     ABC
121     March 22, 2016   8:00 PM        Houston  Oklahoma City     TNT
122     March 22, 2016  10:30 PM        Memphis    L.A. Lakers     TNT
123     March 23, 2016   8:00 PM      Milwaukee      Cleveland    ESPN
124     March 23, 2016  10:30 PM         Dallas       Portland    ESPN
125     March 29, 2016   8:00 PM        Houston      Cleveland     TNT
126     March 29, 2016  10:30 PM     Washington   Golden State     TNT
127     March 31, 2016   7:00 PM        Chicago        Houston     TNT
128     March 31, 2016   9:30 PM  L.A. Clippers  Oklahoma City     TNT
129      April 1, 2016   8:00 PM      Cleveland        Atlanta    ESPN
130      April 1, 2016  10:30 PM         Boston   Golden State    ESPN
131      April 3, 2016   3:30 PM  Oklahoma City        Houston     ABC
132      April 5, 2016   8:00 PM        Chicago        Memphis     TNT
133      April 5, 2016  10:30 PM    L.A. Lakers  L.A. Clippers     TNT
134      April 6, 2016   7:00 PM      Cleveland        Indiana    ESPN
135      April 6, 2016   9:30 PM        Houston         Dallas    ESPN
136      April 7, 2016   8:00 PM        Chicago          Miami     TNT
137      April 7, 2016  10:30 PM    San Antonio   Golden State     TNT
138      April 9, 2016   8:30 PM      Cleveland        Chicago     ABC
139     April 12, 2016   8:00 PM  Oklahoma City    San Antonio     TNT
140     April 12, 2016  10:30 PM        Memphis  L.A. Clippers     TNT
141     April 13, 2016   8:00 PM        Orlando      Charlotte    ESPN
142     April 13, 2016  10:30 PM           Utah    L.A. Lakers    ESPN

[143 rows x 5 columns]