pandas无法使用双引号推断str类型

时间:2015-12-20 07:20:57

标签: python csv pandas dataframe

我想通过pandas.read_csv()从csv文件导入数据。我的数据类型是" "(但这些字符串是数字表示类别)。我发现pandas无法将字符串推断为"object"类型,它将它们推断为int64。请参阅以下示例:

a.csv

uid, f_1, f_2
1,   "1", 1.1
2,   "2", 2.3
3,   "0", 4.8

pandas.read_csv('a.csv').dtypes给出以下输出:

uid:int64
f_1:int64
f_2:float64

f_1的类型是'int64'而不是'object'

但是,如果我将a.csv中的所有"替换为',那么f_1可以正确地传入'object'。如何在不修改'a.csv'的情况下阻止错误推理?另一个问题是,为什么pandas将字符串推断为'object'类型而不是'str'类型?

2 个答案:

答案 0 :(得分:1)

我认为您需要在read_csv中添加参数skipinitialspace

  

skipinitialspace:boolean,默认为False,在分隔符后跳过空格

测试:

import pandas as pd
import numpy as np
import io


temp=u"""uid, f_1, f_2
1,  "1", 1.19
2,  "2", 2.3
3,  "0", 4.8"""

print pd.read_csv(io.StringIO(temp))
   uid    f_1   f_2
0    1    "1"  1.19
1    2    "2"  2.30
2    3    "0"  4.80

#doesn't work dtype    
print pd.read_csv(io.StringIO(temp), dtype= {'f_1': np.int64}).dtypes
uid       int64
 f_1     object
 f_2    float64
dtype: object

print pd.read_csv(io.StringIO(temp), skipinitialspace=True).dtypes
uid      int64
f_1      int64
f_2    float64
dtype: object

如果要删除"列中的第一个和最后一个字符f_1,请使用converters

import pandas as pd
import io


temp=u"""uid, f_1, f_2
1,  "1", 1.19
2,  "2", 2.3
3,  "0", 4.8"""

print pd.read_csv(io.StringIO(temp))
   uid    f_1   f_2
0    1    "1"  1.19
1    2    "2"  2.30
2    3    "0"  4.80

#remove "
def converter(x):
    return x.strip('"')

#define each column
converters={'f_1': converter}

df = pd.read_csv(io.StringIO(temp), skipinitialspace=True, converters = converters)
print df
   uid f_1   f_2
0    1   1  1.19
1    2   2  2.30
2    3   0  4.80
print df.dtypes
uid      int64
f_1     object
f_2    float64
dtype: object

如果您需要将integerf_1转换为string,请使用dtype

import pandas as pd
import io


temp=u"""uid, f_1, f_2
1,  1, 1.19
2,  2, 2.3
3,  0, 4.8"""

print pd.read_csv(io.StringIO(temp)).dtypes
uid       int64
 f_1      int64
 f_2    float64
dtype: object

df = pd.read_csv(io.StringIO(temp), skipinitialspace=True, dtype = {'f_1' : str })

print df
   uid f_1   f_2
0    1   1  1.19
1    2   2  2.30
2    3   0  4.80
print df.dtypes
uid      int64
f_1     object
f_2    float64
dtype: object

注意:不要忘记将io.StringIO(temp)更改为a.csv

解释str vs objecthere

答案 1 :(得分:0)

您可以通过在dtype可选参数中提供列名或字典来强制推断read_csv调用,请参阅read_csv上的pandas文档。