Question

我尝试使用从bballreference.com导入的CSV。但正如您所看到的，分隔的值都在一行中，而不是按列分隔。在NumPy Pandas上，解决这个问题的最简单方法是什么？我用google搜索无济于事。

我不知道如何以干净的方式发布CSV文件，但现在是：

",,,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Shooting,Shooting,Shooting,Per Game,Per Game,Per Game,Per Game,Per Game,Per Game"
"Rk,Player,Age,G,GS,MP,FG,FGA,3P,3PA,FT,FTA,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,FG%,3P%,FT%,MP,PTS,TRB,AST,STL,BLK"
"1,Kevin Durant\duranke01,29,5,5,182,54,107,9,28,22,27,3,34,37,24,7,6,10,7,139,.505,.321,.815,36.5,27.8,7.4,4.8,1.4,1.2"
"2,Klay Thompson\thompkl01,27,5,5,183,38,99,12,43,11,11,3,29,32,9,1,2,6,11,99,.384,.279,1.000,36.7,19.8,6.4,1.8,0.2,0.4"
"3,Stephen Curry\curryst01,29,4,3,125,32,67,15,34,19,19,2,19,21,14,8,2,15,6,98,.478,.441,1.000,31.2,24.5,5.3,3.5,2.0,0.5"
"4,Draymond Green\greendr01,27,5,5,186,27,55,8,20,12,15,12,47,59,50,12,8,18,16,74,.491,.400,.800,37.1,14.8,11.8,10.0,2.4,1.6"
"5,Andre Iguodala\iguodan01,34,5,4,140,14,29,4,12,7,12,4,21,25,17,10,2,3,7,39,.483,.333,.583,27.9,7.8,5.0,3.4,2.0,0.4"
"6,Quinn Cook\cookqu01,24,4,0,58,12,27,0,10,6,8,1,8,9,4,1,0,2,4,30,.444,.000,.750,14.4,7.5,2.3,1.0,0.3,0.0"
"7,Kevon Looney\looneke01,21,5,0,113,12,17,0,0,4,8,10,19,29,5,4,1,2,17,28,.706,,.500,22.6,5.6,5.8,1.0,0.8,0.2"
"8,Shaun Livingston\livinsh01,32,5,0,79,11,27,0,0,4,4,0,6,6,12,0,1,3,9,26,.407,,1.000,15.9,5.2,1.2,2.4,0.0,0.2"
"9,David West\westda01,37,5,0,40,8,14,0,0,0,0,2,5,7,13,2,4,3,4,16,.571,,,7.9,3.2,1.4,2.6,0.4,0.8"
"10,Nick Young\youngni01,32,4,2,41,3,11,3,10,2,3,0,4,4,1,1,0,1,3,11,.273,.300,.667,10.2,2.8,1.0,0.3,0.3,0.0"
"11,JaVale McGee\mcgeeja01,30,3,1,19,3,8,0,1,0,0,4,2,6,0,0,1,0,2,6,.375,.000,,6.2,2.0,2.0,0.0,0.0,0.3"
"12,Zaza Pachulia\pachuza01,33,2,0,8,1,2,0,0,2,4,4,2,6,0,2,0,1,1,4,.500,,.500,4.2,2.0,3.0,0.0,1.0,0.0"
"13,Jordan Bell\belljo01,23,4,0,23,1,4,0,0,1,2,1,5,6,5,2,2,0,2,3,.250,,.500,5.8,0.8,1.5,1.3,0.5,0.5"
"14,Damian Jones\jonesda03,22,1,0,3,0,1,0,0,2,2,0,0,0,0,0,0,0,0,2,.000,,1.000,3.2,2.0,0.0,0.0,0.0,0.0"
",Team Totals,26.5,5,,1200,216,468,51,158,92,115,46,201,247,154,50,29,64,89,575,.462,.323,.800,240.0,115.0,49.4,30.8,10.0,5.8"

Answer 1

似乎CSV文件的前两行是标题，但pd.read_csv的默认行为认为只有第一行是标题。

此外，开头和尾随引号使pd.read_csv认为其间的文字是单个字段/列。

您可以尝试以下方法：

删除开头和尾随引号，

bbal = pd.read_csv('some_file.csv', header=[0, 1], delimiter=',')

以下是如何使用Python删除开头和尾随引号：

# open 'quotes.csv' in read mode with variable in_file as handle
# open 'no_quotes.csv' in write mode with variable out_file as handle
with open('quotes.csv') as in_file, open('no_quotes.csv', 'w') as out_file:
    # read in_file line by line
    # the variable line stores each line as string
    for line in in_file:
        # line[1:-1] slices the string to omit the first and last character
        # append a newline character '\n' to the sliced line
        # write the string with newline to out_file
        out_file.write(line[1:-1] + '\n')

# read_csv on 'no_quotes.csv'
bbal = pd.read_csv('no_quotes.csv', header=[0, 1], delimiter=',')
bbal.head()

Answer 2

考虑将csv作为文本文件读取，在文本文件读取时删除每行的开头/结尾引号，告诉解析器它们之间的所有数据都是一个奇异值。并使用内置StringIO将文本字符串读入数据框而不是保存到磁盘以进行导入。

此外，跳过重复的总计和 Per Game 的第一行，甚至是聚合的最后一行，因为你可以用pandas做到这一点。

from io import StringIO
import pandas as pd

with open('BasketballCSVQuotes.csv') as f:
    csvdata = f.read().replace('"', '')

df = pd.read_csv(StringIO(csvdata), skiprows=1, skipfooter=1, engine='python')

print(df)

<强>输出

      Rk                      Player   Age  G   GS    MP   FG  FGA  3P  3PA  ...    PTS    FG%    3P%    FT%   MP.1  PTS.1  TRB.1  AST.1  STL.1  BLK.1
0    1.0      Kevin Durant\duranke01  29.0  5  5.0   182   54  107   9   28  ...    139  0.505  0.321  0.815   36.5   27.8    7.4    4.8    1.4    1.2
1    2.0     Klay Thompson\thompkl01  27.0  5  5.0   183   38   99  12   43  ...     99  0.384  0.279  1.000   36.7   19.8    6.4    1.8    0.2    0.4
2    3.0     Stephen Curry\curryst01  29.0  4  3.0   125   32   67  15   34  ...     98  0.478  0.441  1.000   31.2   24.5    5.3    3.5    2.0    0.5
3    4.0    Draymond Green\greendr01  27.0  5  5.0   186   27   55   8   20  ...     74  0.491  0.400  0.800   37.1   14.8   11.8   10.0    2.4    1.6
4    5.0    Andre Iguodala\iguodan01  34.0  5  4.0   140   14   29   4   12  ...     39  0.483  0.333  0.583   27.9    7.8    5.0    3.4    2.0    0.4
5    6.0         Quinn Cook\cookqu01  24.0  4  0.0    58   12   27   0   10  ...     30  0.444  0.000  0.750   14.4    7.5    2.3    1.0    0.3    0.0
6    7.0      Kevon Looney\looneke01  21.0  5  0.0   113   12   17   0    0  ...     28  0.706    NaN  0.500   22.6    5.6    5.8    1.0    0.8    0.2
7    8.0  Shaun Livingston\livinsh01  32.0  5  0.0    79   11   27   0    0  ...     26  0.407    NaN  1.000   15.9    5.2    1.2    2.4    0.0    0.2
8    9.0         David West\westda01  37.0  5  0.0    40    8   14   0    0  ...     16  0.571    NaN    NaN    7.9    3.2    1.4    2.6    0.4    0.8
9   10.0        Nick Young\youngni01  32.0  4  2.0    41    3   11   3   10  ...     11  0.273  0.300  0.667   10.2    2.8    1.0    0.3    0.3    0.0
10  11.0      JaVale McGee\mcgeeja01  30.0  3  1.0    19    3    8   0    1  ...      6  0.375  0.000    NaN    6.2    2.0    2.0    0.0    0.0    0.3
11  12.0     Zaza Pachulia\pachuza01  33.0  2  0.0     8    1    2   0    0  ...      4  0.500    NaN  0.500    4.2    2.0    3.0    0.0    1.0    0.0
12  13.0           Jordan Belelljo01  23.0  4  0.0    23    1    4   0    0  ...      3  0.250    NaN  0.500    5.8    0.8    1.5    1.3    0.5    0.5
13  14.0      Damian Jones\jonesda03  22.0  1  0.0     3    0    1   0    0  ...      2  0.000    NaN  1.000    3.2    2.0    0.0    0.0    0.0    0.0

[14 rows x 30 columns]

Numpy将CSV分隔为列

2 个答案: