我尝试使用pandas.read_csv导入csv文件。该文件如下:
"COL_A","COL_B","COL_C"
"ROW1COLA","ROW1COLB","ROW1COLC","ROW1COLD"
"ROW2COLA","ROW2COLB","ROW2COLC","ROW2COLD"
"ROW3COLA","ROW3COLB","ROW3COLC","ROW3COLD"
"ROW4COLA","ROW4COLB","ROW4COLC","ROW4COLD"
"ROW5COLA","ROW5COLB","ROW5COLC","ROW5COLD"
"ROW6COLA","ROW6COLB","ROW6COLC","ROW6COLD"
"ROW7COLA","ROW7COLB","ROW7COLC","ROW7COLD"
我第一次尝试跑了:
data = pd.read_csv('broken.csv')
我得到了:
COL_A COL_B COL_C
ROW1COLA ROW1COLB ROW1COLC ROW1COLD
ROW2COLA ROW2COLB ROW2COLC ROW2COLD
ROW3COLA ROW3COLB ROW3COLC ROW3COLD
ROW4COLA ROW4COLB ROW4COLC ROW4COLD
ROW5COLA ROW5COLB ROW5COLC ROW5COLD
ROW6COLA ROW6COLB ROW6COLC ROW6COLD
ROW7COLA ROW7COLB ROW7COLC ROW7COLD
设置index_col = False
data = pd.read_csv('broken.csv',index_col=False)
我得到了
COL_A COL_B COL_C
0 ROW1COLA ROW1COLB ROW1COLC
1 ROW2COLA ROW2COLB ROW2COLC
2 ROW3COLA ROW3COLB ROW3COLC
3 ROW4COLA ROW4COLB ROW4COLC
4 ROW5COLA ROW5COLB ROW5COLC
5 ROW6COLA ROW6COLB ROW6COLC
6 ROW7COLA ROW7COLB ROW7COLC
如果我添加前缀=' X'
data = pd.read_csv('broken.csv',index_col=False,prefix='X')
我得到了
COL_A COL_B COL_C
0 ROW1COLA ROW1COLB ROW1COLC
1 ROW2COLA ROW2COLB ROW2COLC
2 ROW3COLA ROW3COLB ROW3COLC
3 ROW4COLA ROW4COLB ROW4COLC
4 ROW5COLA ROW5COLB ROW5COLC
5 ROW6COLA ROW6COLB ROW6COLC
6 ROW7COLA ROW7COLB ROW7COLC
与read_table相同
data = pd.read_table('broken.csv',index_col=True,sep=',')
我想知道pandas是否有任何方法可以自动分配标题并获取缺少标题列的值
答案 0 :(得分:2)
没有名称/标题的第一列被视为索引列。
您还应该正确使用index_col
参数:
data = pd.read_table('broken.csv',index_col=[0],sep=',')
如果您的第一列包含数据而不是索引,则可以跳过第一行,指定列的名称,并指示read_csv
您不想读取标题:
cols = ['col1','col2','col3','col4']
data = pd.read_table('broken.csv',sep=',', skiprows=[0], header=None, names=cols)
答案 1 :(得分:2)
我认为你可以使用read_csv
参数header=0
,第一行设置为列,然后被参数names
覆盖为自定义列名。省略参数sep=','
,因为它默认为:
import pandas as pd
import io
temp=u'''"COL_A","COL_B","COL_C"
"ROW1COLA","ROW1COLB","ROW1COLC","ROW1COLD"
"ROW2COLA","ROW2COLB","ROW2COLC","ROW2COLD"
"ROW3COLA","ROW3COLB","ROW3COLC","ROW3COLD"
"ROW4COLA","ROW4COLB","ROW4COLC","ROW4COLD"
"ROW5COLA","ROW5COLB","ROW5COLC","ROW5COLD"
"ROW6COLA","ROW6COLB","ROW6COLC","ROW6COLD"
"ROW7COLA","ROW7COLB","ROW7COLC","ROW7COLD"'''
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), header=0, names=['a','b','c','d'])
print df
a b c d
0 ROW1COLA ROW1COLB ROW1COLC ROW1COLD
1 ROW2COLA ROW2COLB ROW2COLC ROW2COLD
2 ROW3COLA ROW3COLB ROW3COLC ROW3COLD
3 ROW4COLA ROW4COLB ROW4COLC ROW4COLD
4 ROW5COLA ROW5COLB ROW5COLC ROW5COLD
5 ROW6COLA ROW6COLB ROW6COLC ROW6COLD
6 ROW7COLA ROW7COLB ROW7COLC ROW7COLD
带有参数header=None
的更通用的解决方案,对于带有skiprows=[0]
标题的列名称没有跳过第一行,但缺少最后一列的名称:
import pandas as pd
import io
temp=u'''"COL_A","COL_B","COL_C"
"ROW1COLA","ROW1COLB","ROW1COLC","ROW1COLD"
"ROW2COLA","ROW2COLB","ROW2COLC","ROW2COLD"
"ROW3COLA","ROW3COLB","ROW3COLC","ROW3COLD"
"ROW4COLA","ROW4COLB","ROW4COLC","ROW4COLD"
"ROW5COLA","ROW5COLB","ROW5COLC","ROW5COLD"
"ROW6COLA","ROW6COLB","ROW6COLC","ROW6COLD"
"ROW7COLA","ROW7COLB","ROW7COLC","ROW7COLD"'''
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), header=None, skiprows=[0])
print df
0 1 2 3
0 ROW1COLA ROW1COLB ROW1COLC ROW1COLD
1 ROW2COLA ROW2COLB ROW2COLC ROW2COLD
2 ROW3COLA ROW3COLB ROW3COLC ROW3COLD
3 ROW4COLA ROW4COLB ROW4COLC ROW4COLD
4 ROW5COLA ROW5COLB ROW5COLC ROW5COLD
5 ROW6COLA ROW6COLB ROW6COLC ROW6COLD
6 ROW7COLA ROW7COLB ROW7COLC ROW7COLD