我尝试使用从bballreference.com导入的CSV。但正如您所看到的,分隔的值都在一行中,而不是按列分隔。在NumPy Pandas上,解决这个问题的最简单方法是什么?我用google搜索无济于事。
我不知道如何以干净的方式发布CSV文件,但现在是:
",,,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Shooting,Shooting,Shooting,Per Game,Per Game,Per Game,Per Game,Per Game,Per Game"
"Rk,Player,Age,G,GS,MP,FG,FGA,3P,3PA,FT,FTA,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,FG%,3P%,FT%,MP,PTS,TRB,AST,STL,BLK"
"1,Kevin Durant\duranke01,29,5,5,182,54,107,9,28,22,27,3,34,37,24,7,6,10,7,139,.505,.321,.815,36.5,27.8,7.4,4.8,1.4,1.2"
"2,Klay Thompson\thompkl01,27,5,5,183,38,99,12,43,11,11,3,29,32,9,1,2,6,11,99,.384,.279,1.000,36.7,19.8,6.4,1.8,0.2,0.4"
"3,Stephen Curry\curryst01,29,4,3,125,32,67,15,34,19,19,2,19,21,14,8,2,15,6,98,.478,.441,1.000,31.2,24.5,5.3,3.5,2.0,0.5"
"4,Draymond Green\greendr01,27,5,5,186,27,55,8,20,12,15,12,47,59,50,12,8,18,16,74,.491,.400,.800,37.1,14.8,11.8,10.0,2.4,1.6"
"5,Andre Iguodala\iguodan01,34,5,4,140,14,29,4,12,7,12,4,21,25,17,10,2,3,7,39,.483,.333,.583,27.9,7.8,5.0,3.4,2.0,0.4"
"6,Quinn Cook\cookqu01,24,4,0,58,12,27,0,10,6,8,1,8,9,4,1,0,2,4,30,.444,.000,.750,14.4,7.5,2.3,1.0,0.3,0.0"
"7,Kevon Looney\looneke01,21,5,0,113,12,17,0,0,4,8,10,19,29,5,4,1,2,17,28,.706,,.500,22.6,5.6,5.8,1.0,0.8,0.2"
"8,Shaun Livingston\livinsh01,32,5,0,79,11,27,0,0,4,4,0,6,6,12,0,1,3,9,26,.407,,1.000,15.9,5.2,1.2,2.4,0.0,0.2"
"9,David West\westda01,37,5,0,40,8,14,0,0,0,0,2,5,7,13,2,4,3,4,16,.571,,,7.9,3.2,1.4,2.6,0.4,0.8"
"10,Nick Young\youngni01,32,4,2,41,3,11,3,10,2,3,0,4,4,1,1,0,1,3,11,.273,.300,.667,10.2,2.8,1.0,0.3,0.3,0.0"
"11,JaVale McGee\mcgeeja01,30,3,1,19,3,8,0,1,0,0,4,2,6,0,0,1,0,2,6,.375,.000,,6.2,2.0,2.0,0.0,0.0,0.3"
"12,Zaza Pachulia\pachuza01,33,2,0,8,1,2,0,0,2,4,4,2,6,0,2,0,1,1,4,.500,,.500,4.2,2.0,3.0,0.0,1.0,0.0"
"13,Jordan Bell\belljo01,23,4,0,23,1,4,0,0,1,2,1,5,6,5,2,2,0,2,3,.250,,.500,5.8,0.8,1.5,1.3,0.5,0.5"
"14,Damian Jones\jonesda03,22,1,0,3,0,1,0,0,2,2,0,0,0,0,0,0,0,0,2,.000,,1.000,3.2,2.0,0.0,0.0,0.0,0.0"
",Team Totals,26.5,5,,1200,216,468,51,158,92,115,46,201,247,154,50,29,64,89,575,.462,.323,.800,240.0,115.0,49.4,30.8,10.0,5.8"
答案 0 :(得分:2)
似乎CSV文件的前两行是标题,但pd.read_csv
的默认行为认为只有第一行是标题。
此外,开头和尾随引号使pd.read_csv
认为其间的文字是单个字段/列。
您可以尝试以下方法:
删除开头和尾随引号,
bbal = pd.read_csv('some_file.csv', header=[0, 1], delimiter=',')
以下是如何使用Python删除开头和尾随引号:
# open 'quotes.csv' in read mode with variable in_file as handle
# open 'no_quotes.csv' in write mode with variable out_file as handle
with open('quotes.csv') as in_file, open('no_quotes.csv', 'w') as out_file:
# read in_file line by line
# the variable line stores each line as string
for line in in_file:
# line[1:-1] slices the string to omit the first and last character
# append a newline character '\n' to the sliced line
# write the string with newline to out_file
out_file.write(line[1:-1] + '\n')
# read_csv on 'no_quotes.csv'
bbal = pd.read_csv('no_quotes.csv', header=[0, 1], delimiter=',')
bbal.head()
答案 1 :(得分:0)
考虑将csv作为文本文件读取,在文本文件读取时删除每行的开头/结尾引号,告诉解析器它们之间的所有数据都是一个奇异值。并使用内置StringIO
将文本字符串读入数据框而不是保存到磁盘以进行导入。
此外,跳过重复的总计和 Per Game 的第一行,甚至是聚合的最后一行,因为你可以用pandas做到这一点。
from io import StringIO
import pandas as pd
with open('BasketballCSVQuotes.csv') as f:
csvdata = f.read().replace('"', '')
df = pd.read_csv(StringIO(csvdata), skiprows=1, skipfooter=1, engine='python')
print(df)
<强>输出强>
Rk Player Age G GS MP FG FGA 3P 3PA ... PTS FG% 3P% FT% MP.1 PTS.1 TRB.1 AST.1 STL.1 BLK.1
0 1.0 Kevin Durant\duranke01 29.0 5 5.0 182 54 107 9 28 ... 139 0.505 0.321 0.815 36.5 27.8 7.4 4.8 1.4 1.2
1 2.0 Klay Thompson\thompkl01 27.0 5 5.0 183 38 99 12 43 ... 99 0.384 0.279 1.000 36.7 19.8 6.4 1.8 0.2 0.4
2 3.0 Stephen Curry\curryst01 29.0 4 3.0 125 32 67 15 34 ... 98 0.478 0.441 1.000 31.2 24.5 5.3 3.5 2.0 0.5
3 4.0 Draymond Green\greendr01 27.0 5 5.0 186 27 55 8 20 ... 74 0.491 0.400 0.800 37.1 14.8 11.8 10.0 2.4 1.6
4 5.0 Andre Iguodala\iguodan01 34.0 5 4.0 140 14 29 4 12 ... 39 0.483 0.333 0.583 27.9 7.8 5.0 3.4 2.0 0.4
5 6.0 Quinn Cook\cookqu01 24.0 4 0.0 58 12 27 0 10 ... 30 0.444 0.000 0.750 14.4 7.5 2.3 1.0 0.3 0.0
6 7.0 Kevon Looney\looneke01 21.0 5 0.0 113 12 17 0 0 ... 28 0.706 NaN 0.500 22.6 5.6 5.8 1.0 0.8 0.2
7 8.0 Shaun Livingston\livinsh01 32.0 5 0.0 79 11 27 0 0 ... 26 0.407 NaN 1.000 15.9 5.2 1.2 2.4 0.0 0.2
8 9.0 David West\westda01 37.0 5 0.0 40 8 14 0 0 ... 16 0.571 NaN NaN 7.9 3.2 1.4 2.6 0.4 0.8
9 10.0 Nick Young\youngni01 32.0 4 2.0 41 3 11 3 10 ... 11 0.273 0.300 0.667 10.2 2.8 1.0 0.3 0.3 0.0
10 11.0 JaVale McGee\mcgeeja01 30.0 3 1.0 19 3 8 0 1 ... 6 0.375 0.000 NaN 6.2 2.0 2.0 0.0 0.0 0.3
11 12.0 Zaza Pachulia\pachuza01 33.0 2 0.0 8 1 2 0 0 ... 4 0.500 NaN 0.500 4.2 2.0 3.0 0.0 1.0 0.0
12 13.0 Jordan Belelljo01 23.0 4 0.0 23 1 4 0 0 ... 3 0.250 NaN 0.500 5.8 0.8 1.5 1.3 0.5 0.5
13 14.0 Damian Jones\jonesda03 22.0 1 0.0 3 0 1 0 0 ... 2 0.000 NaN 1.000 3.2 2.0 0.0 0.0 0.0 0.0
[14 rows x 30 columns]