Pandas替换方法和对象数据类型

时间:2016-12-17 12:47:28

标签: python pandas dataframe

我在包含df= df.replace('No data', np.nan)的csv文件上使用‘No data’,而不是没有数字数据的空/空条目。使用head()方法,我看到replace方法确实用‘No data’替换了NaN条目。当我使用df.info()时,它表示每个系列的数据类型都是一个对象。

当我在Excel中打开csv文件并使用find和replace手动编辑数据时,将‘No data’更改为空白/空条目,尽管当我使用df.head()时数据框看起来完全相同,我使用df.info()它表示数据类型是浮点数。

我想知道为什么会这样,我怎样才能使我的系列的数据类型为浮点数,而不必手动编辑csv文件。

3 个答案:

答案 0 :(得分:2)

如果列中的其他数据是数字,那么您应该使用pd.to_numeric(df, errors='coerce')

答案 1 :(得分:0)

import pandas as pd
import numpy as np

# Create data for columns with strings in it
column_data = [1,2,3] + ['no data']
# Costruct data frame with two columns
df = pd.DataFrame({'col1':column_data, 'col2':column_data[::-1]})
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
col1    4 non-null object
col2    4 non-null object
dtypes: object(2)
memory usage: 144.0+ bytes

# Replace 'no data' with Nan
df_nan = df.replace('no data', np.nan)
# Set type of all columns to float
df_result = df_nan.as_type({c:float for c in df_nan.columns})
df_result.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
col1    3 non-null float64
col2    3 non-null float64
dtypes: float64(2)
memory usage: 144.0 bytes

答案 2 :(得分:0)

考虑使用pandas.read_csv()中的转换器 arg,您可以在其中传递引用转换函数的列号字典。下面检查无数据字符串并有条件地用np.nan替换值,否则保持原样:

import numpy as np
import pandas as pd

c_fct = lambda x : float(x if 'No data' not in x else np.nan)
convertdict = {1:c_fct, 2:c_fct, 3:c_fct, 4:c_fct, 5:c_fct}

df = pd.read_csv('Input.csv', converters=convertdict)

输入CSV

ID      Col1            Col2            Col3            Col4            Col5
TGG     0.634516647     0.900464347     0.998505978     0.170422713     0.893340128
GRI     No data         0.045915333     0.718398939     0.924813864     No data
NLB     0.921127268     0.614460813     0.677857676     0.343612947     0.559437744
SEI     0.081852313     No data         0.890816385     0.943313021     0.874857844
LOY     0.632556715     0.362855866     0.038702448     0.253762859     No data
OPL     0.375088582     0.268283238     0.761552111     0.589547625     0.192223208
CTK     0.349464541     0.844718987     No data         0.841439909     0.898093646
EUE     0.629784261     0.982589843     0.315670377     0.832419474     0.950044814
JLP     0.543942659     0.988380305     0.417191823     0.823857176     0.542514099
RHK     0.728053447     0.521816539     0.402523435     No data         0.558226706
AEM     0.005495116     0.715363776     0.075508356     0.959119268     0.844730368
VLQ     0.21146319      0.558208766     0.501769554     0.226539046     0.795861461
MDB     0.230514689     0.223163664     No data         0.324636384     0.700716246
LPH     0.853433224     0.582678173     0.633109347     0.432191426     No data
PEP     0.41096305      No data         .627776178      0.482359278     0.179863537
UQK     0.252598809     0.497517585     0.276060768     No data         0.087985623
KGJ     0.033985585     0.033702088     anNo data       0.286682709     0.543349787
JUQ     0.25971543      0.142067155     0.597985191     0.219841249     0.699822866
NYW     No data         0.17187907      0.157413049     0.209011772     0.592824483

<强>输出

print(df)

#      ID      Col1      Col2      Col3      Col4      Col5
# 0   TGG  0.634517  0.900464  0.998506  0.170423  0.893340
# 1   GRI       NaN  0.045915  0.718399  0.924814       NaN
# 2   NLB  0.921127  0.614461  0.677858  0.343613  0.559438
# 3   SEI  0.081852       NaN  0.890816  0.943313  0.874858
# 4   LOY  0.632557  0.362856  0.038702  0.253763       NaN
# 5   OPL  0.375089  0.268283  0.761552  0.589548  0.192223
# 6   CTK  0.349465  0.844719       NaN  0.841440  0.898094
# 7   EUE  0.629784  0.982590  0.315670  0.832419  0.950045
# 8   JLP  0.543943  0.988380  0.417192  0.823857  0.542514
# 9   RHK  0.728053  0.521817  0.402523       NaN  0.558227
# 10  AEM  0.005495  0.715364  0.075508  0.959119  0.844730
# 11  VLQ  0.211463  0.558209  0.501770  0.226539  0.795861
# 12  MDB  0.230515  0.223164       NaN  0.324636  0.700716
# 13  LPH  0.853433  0.582678  0.633109  0.432191       NaN
# 14  PEP  0.410963       NaN  0.627776  0.482359  0.179864
# 15  UQK  0.252599  0.497518  0.276061       NaN  0.087986
# 16  KGJ  0.033986  0.033702       NaN  0.286683  0.543350
# 17  JUQ  0.259715  0.142067  0.597985  0.219841  0.699823
# 18  NYW       NaN  0.171879  0.157413  0.209012  0.592824

print(df.types)

# ID       object
# Col1    float64
# Col2    float64
# Col3    float64
# Col4    float64
# Col5    float64
# dtype: object