因此,我试图从Kaggle读取此数据集。
https://www.kaggle.com/gmadevs/atp-matches-dataset#atp_matches_2016.csv
我正在使用pandas的read_csv函数,但没有正确拆分列。我已经尝试过此代码
df_2016 = pd.read_csv("Path/to/file/atp_matches_2016.csv")
打印出的数据框给了我
tourney_id ... l_bpFaced
2016-M020 Brisbane Hard 32.0 A 20160104.0 300.0 105683.0 4.0 NaN Milos Raonic R 196.0 CAN 25.021218 14.0 2170.0 103819.0 1.0 NaN Roger Federer ... NaN
299.0 103819.0 1.0 NaN Roger Federer R 185.0 SUI 34.406571 3.0 8265.0 106233.0 8.0 NaN Dominic Thiem ... NaN
298.0 105683.0 4.0 NaN Milos Raonic R 196.0 CAN 25.021218 14.0 2170.0 106071.0 7.0 NaN Bernard Tomic ... NaN
297.0 103819.0 1.0 NaN Roger Federer R 185.0 SUI 34.406571 3.0 8265.0 105777.0 NaN NaN Grigor Dimitrov ... NaN
296.0 106233.0 8.0 NaN Dominic Thiem R NaN AUT 22.335387 20.0 1600.0 105227.0 3.0 NaN Marin Cilic ... NaN
为什么在拆分列时遇到问题?
我期望得到这样的输出,由于某种原因,这是除2016年和2017年以外我每年获得的输出。
tourney_id tourney_name surface ... l_SvGms l_bpSaved l_bpFaced
0 2015-329 Tokyo Hard ... 10.0 2.0 5.0
1 2015-329 Tokyo Hard ... 13.0 12.0 19.0
2 2015-329 Tokyo Hard ... 18.0 9.0 11.0
3 2015-329 Tokyo Hard ... 13.0 4.0 8.0
4 2015-329 Tokyo Hard ... 10.0 1.0 5.0
实际的csv文件看起来状态良好,并且格式与其他年份相同。我还尝试在read_csv函数中使用columns参数指定列,但这会提供相同的输出。
答案 0 :(得分:2)
我能想到的最安全的方法是两次读取csv:
rows = pd.read_csv('path/to/atp_matches_2016.csv', skiprows=[0], header = None)
# skip header line
rows = rows.dropna(axis=1, how='all')
# drop columns that only have NaNs
rows.columns = pd.read_csv('path/to/atp_matches_2016.csv', nrows=0).columns
print(rows.head(5))
输出:
tourney_id tourney_name surface draw_size tourney_level tourney_date \
0 2016-M020 Brisbane Hard 32.0 A 20160104.0
1 2016-M020 Brisbane Hard 32.0 A 20160104.0
2 2016-M020 Brisbane Hard 32.0 A 20160104.0
3 2016-M020 Brisbane Hard 32.0 A 20160104.0
4 2016-M020 Brisbane Hard 32.0 A 20160104.0
match_num winner_id winner_seed winner_entry ... w_bpFaced l_ace l_df \
0 300.0 105683.0 4.0 NaN ... 1.0 7.0 3.0
1 299.0 103819.0 1.0 NaN ... 1.0 2.0 4.0
2 298.0 105683.0 4.0 NaN ... 4.0 10.0 3.0
3 297.0 103819.0 1.0 NaN ... 1.0 8.0 2.0
4 296.0 106233.0 8.0 NaN ... 2.0 11.0 2.0
l_svpt l_1stIn l_1stWon l_2ndWon l_SvGms l_bpSaved l_bpFaced
0 61.0 34.0 25.0 14.0 10.0 3.0 5.0
1 55.0 31.0 18.0 9.0 8.0 2.0 6.0
2 84.0 54.0 41.0 16.0 12.0 2.0 2.0
3 104.0 62.0 46.0 21.0 16.0 8.0 11.0
4 98.0 52.0 41.0 27.0 15.0 7.0 8.0
答案 1 :(得分:1)
尝试一下
import csv
with open('Path/to/file/atp_matches_2016.csv') as csvfile:
reader = csv.DictReader(csvfile)
Create是一个对象,其功能类似于常规读取器,但是将读取的信息映射到字典中,该字典的键由可选字段名称参数给出。字段名称参数是一个序列,其元素按顺序与输入数据的字段关联。这些元素成为结果字典的键。如果省略了字段名称参数,则文件f第一行中的值将用作字段名称。如果读取的行中的字段多于字段名序列,则其余数据将作为由restkey值键控的序列添加。如果读取的行中的字段少于字段名序列,则其余键采用可选的restval参数的值。
答案 2 :(得分:1)
如果.csv文件中有空列,则可以使用参数keep_default_na=False
使read_csv
将空列解释为空字符串。
df_2016 = pd.read_csv("Path/to/file/atp_matches_2016.csv", keep_default_na=False)
答案 3 :(得分:0)
一种方法是使用您知道的包含正确列的行来准备csv文件。就这样
cols = pd.read_csv('Path\\to\\file\\atp_matches_2016.csv', nrows=1).columns
df = pd.read_csv('Path\\to\\file\\atp_matches_2016.csv', usecols=cols)
输出:
df.head()
tourney_id tourney_name surface draw_size tourney_level ... l_1stWon l_2ndWon l_SvGms l_bpSaved l_bpFaced
0 2016-M020 Brisbane Hard 32.0 A ... 25.0 14.0 10.0 3.0 5.0
1 2016-M020 Brisbane Hard 32.0 A ... 18.0 9.0 8.0 2.0 6.0
2 2016-M020 Brisbane Hard 32.0 A ... 41.0 16.0 12.0 2.0 2.0
3 2016-M020 Brisbane Hard 32.0 A ... 46.0 21.0 16.0 8.0 11.0
4 2016-M020 Brisbane Hard 32.0 A ... 41.0 27.0 15.0 7.0 8.0
[5 rows x 49 columns]
答案 4 :(得分:0)