使用pandas将字符串对象转换为int / float

时间:2014-06-04 12:31:34

标签: python csv pandas

import pandas as pd

path1 = "/home/supertramp/Desktop/100&life_180_data.csv"

mydf =  pd.read_csv(path1)

numcigar = {"Never":0 ,"1-5 Cigarettes/day" :1,"10-20 Cigarettes/day":4}

print mydf['Cigarettes']

mydf['CigarNum'] = mydf['Cigarettes'].apply(numcigar.get).astype(float)

print mydf['CigarNum']

mydf.to_csv('/home/supertramp/Desktop/powerRangers.csv')

csv文件“100& life_180_data.csv”包含age,bmi,Cigarettes,Alocohol等列。

No                int64
Age               int64
BMI             float64
Alcohol          object
Cigarettes       object
dtype: object

香烟栏包含“从不”“1-5根香烟/天”,“10-20根香烟/天”。 我想为这些物体分配重量(从不,1-5根香烟/天,......)

预期输出是附加的新列CigarNum,其仅包含数字0,1,2 预期CigarNum直到8行,然后显示Nan到CigarNum列的最后一行

0                     Never
1                     Never
2        1-5 Cigarettes/day
3                     Never
4                     Never
5                     Never
6                     Never
7                     Never
8                     Never
9                     Never
10                    Never
11                    Never
12     10-20 Cigarettes/day
13       1-5 Cigarettes/day
14                    Never
...
167                    Never
168                    Never
169     10-20 Cigarettes/day
170                    Never
171                    Never
172                    Never
173                    Never
174                    Never
175                    Never
176                    Never
177                    Never
178                    Never
179                    Never
180                    Never
181                    Never
Name: Cigarettes, Length: 182, dtype: object

我得到的输出几乎没有在第一行后给NaN。

0      0
1      0
2      1
3      0
4      0
5      0
6      0
7      0
8      0
9      0
10   NaN
11   NaN
12   NaN
13   NaN
14     0
...
167   NaN
168   NaN
169   NaN
170   NaN
171   NaN
172   NaN
173   NaN
174   NaN
175   NaN
176   NaN
177   NaN
178   NaN
179   NaN
180   NaN
181   NaN
Name: CigarNum, Length: 182, dtype: float64

2 个答案:

答案 0 :(得分:33)

好的,首先问题是你有嵌入空格导致函数错误地应用:

使用向量化str

解决此问题
mydf['Cigarettes'] = mydf['Cigarettes'].str.replace(' ', '')

现在创建新列应该正常工作:

mydf['CigarNum'] = mydf['Cigarettes'].apply(numcigar.get).astype(float)

<强>更新

感谢@Jeff一如既往地指出了卓越的做事方式:

因此,您可以致电replace而不是致电apply

mydf['CigarNum'] = mydf['Cigarettes'].replace(numcigar)
# now convert the types
mydf['CigarNum'] = mydf['CigarNum'].convert_objects(convert_numeric=True)

您也可以使用factorize方法。

考虑一下为什么不将dict值设置为浮点数然后避免类型转换?

所以:

numcigar = {"Never":0.0 ,"1-5 Cigarettes/day" :1.0,"10-20 Cigarettes/day":4.0}

版本0.17.0或更新

convert_objects0.17.0后已弃用,已被to_numeric

取代
mydf['CigarNum'] = pd.to_numeric(mydf['CigarNum'], errors='coerce')

此处errors='coerce'将返回NaN,其中值无法转换为数值,否则会引发异常

答案 1 :(得分:4)

尝试将此功能用于此类所有问题:

def get_series_ids(x):
    '''Function returns a pandas series consisting of ids, 
       corresponding to objects in input pandas series x
       Example: 
       get_series_ids(pd.Series(['a','a','b','b','c'])) 
       returns Series([0,0,1,1,2], dtype=int)'''

    values = np.unique(x)
    values2nums = dict(zip(values,range(len(values))))
    return x.replace(values2nums)