读数据为每个转换器的浮点数

时间:2015-01-16 11:00:03

标签: python csv pandas

我有一个名为'filename'的csv文件,想要将这些数据读入64float,但列'hour'除外。我使用pd.read_csv - 函数和转换器来管理它。

df = pd.read_csv("../data/filename.csv",
                 delimiter = ';',
                 date_parser = ['hour'],
                 skiprows = 1,
                 converters={'column1': lambda x: float(x.replace   ('.','').replace(',','.'))})

现在,我有两点:

FIRST:

分隔符适用; ,但如果我在记事本中查看我的数据,则有',',而不是';'。但如果我拿','我得到:'pandas.parser.CParserError:错误标记数据。 C错误:第13行预期有7个字段,见9'

第二

如果我想将转换器用于所有列,我该怎么做到这一点?!什么是正确的用语? 我尝试在readin-function中使用dtype = float,但我得到'AttributeError:'NoneType'对象没有属性'dtype''什么事情发生?这就是为什么我想用转换器管理它的原因。

数据:

  

,小时,光伏,陆上风,近海风,PV.1,陆上风.1,风   离岸1,PV.2,Wind onshore.2,Wind offshore   0,1,0.0, “12,985.0”, “9,614.0”,0.0, “32,825.5”, “9,495.7”,0.0, “13,110.3”, “10,855.5”   1,2,0.0, “12,908.9”, “9,290.8”,0.0, “36,052.3”, “9,589.1”,0.0, “13,670.2”, “10,828.6”   2,3,0.0, “12,740.9”, “8,886.9”,0.0, “38,540.9”, “10,087.3”,0.0, “14,610.8”, “10,828.6”   3,4,0.0, “12,485.3”, “8,644.5”,0.0, “40,734.0”, “10,087.3”,0.0, “15,638.3”, “10,343.7”   4,5,0.0, “11,188.5”, “8,079.0”,0.0, “42,688.0”, “10,087.3”,0.0, “16,809.4”, “10,343.7”   5,6,0.0, “11,219.0”, “7,594.2”,0.0, “43,333.5”, “10,025.0”,0.0, “18,266.9”, “10,343.7”

1 个答案:

答案 0 :(得分:1)

这应该有效:

In [40]:
# text data
temp=''',hour,PV,Wind onshore,Wind offshore,PV.1,Wind onshore.1,Wind offshore.1,PV.2,Wind onshore.2,Wind offshore.2
0,1,0.0,"12,985.0","9,614.0",0.0,"32,825.5","9,495.7",0.0,"13,110.3","10,855.5"
1,2,0.0,"12,908.9","9,290.8",0.0,"36,052.3","9,589.1",0.0,"13,670.2","10,828.6"
2,3,0.0,"12,740.9","8,886.9",0.0,"38,540.9","10,087.3",0.0,"14,610.8","10,828.6"
3,4,0.0,"12,485.3","8,644.5",0.0,"40,734.0","10,087.3",0.0,"15,638.3","10,343.7"
4,5,0.0,"11,188.5","8,079.0",0.0,"42,688.0","10,087.3",0.0,"16,809.4","10,343.7"
5,6,0.0,"11,219.0","7,594.2",0.0,"43,333.5","10,025.0",0.0,"18,266.9","10,343.7"'''
# so read the csv, pass params quotechar and the thousands character
df = pd.read_csv(io.StringIO(temp), quotechar='"', thousands=',')
df
Out[40]:
   Unnamed: 0  hour  PV  Wind onshore  Wind offshore  PV.1  Wind onshore.1  \
0           0     1   0       12985.0         9614.0     0         32825.5   
1           1     2   0       12908.9         9290.8     0         36052.3   
2           2     3   0       12740.9         8886.9     0         38540.9   
3           3     4   0       12485.3         8644.5     0         40734.0   
4           4     5   0       11188.5         8079.0     0         42688.0   
5           5     6   0       11219.0         7594.2     0         43333.5   

   Wind offshore.1  PV.2  Wind onshore.2  Wind offshore.2  
0           9495.7     0         13110.3          10855.5  
1           9589.1     0         13670.2          10828.6  
2          10087.3     0         14610.8          10828.6  
3          10087.3     0         15638.3          10343.7  
4          10087.3     0         16809.4          10343.7  
5          10025.0     0         18266.9          10343.7  
In [41]:
# check the dtypes
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 0 to 5
Data columns (total 11 columns):
Unnamed: 0         6 non-null int64
hour               6 non-null int64
PV                 6 non-null float64
Wind onshore       6 non-null float64
Wind offshore      6 non-null float64
PV.1               6 non-null float64
Wind onshore.1     6 non-null float64
Wind offshore.1    6 non-null float64
PV.2               6 non-null float64
Wind onshore.2     6 non-null float64
Wind offshore.2    6 non-null float64
dtypes: float64(9), int64(2)
memory usage: 576.0 bytes

所以基本上你需要将quotechar='"'thousands=','参数传递给read_csv以达到你想要的效果,请参阅文档:http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv

修改

如果您想在导入后进行转换(当您可以预先执行此操作时这是一种浪费),那么您可以为每个感兴趣的列执行此操作:

In [43]:
# replace the comma separator
df['Wind onshore'] = df['Wind onshore'].str.replace(',','')
# convert the type
df['Wind onshore'] = df['Wind onshore'].astype(np.float64)
df['Wind onshore'].dtype
Out[43]:
dtype('float64')

首先替换所有感兴趣的列上的逗号分隔符会更快,只需像convert_objects

那样调用df.convert_objects(convert_numeric=True)