在以下情况下,我通过使用for循环对Pandas数据框中的多列应用了归一化:
A,B列之间的归一化: [-1,+1]
C列的标准化: [-40,+150]
并将结果替换为替代数据框,我们调用norm_data
并将其存储为csv文件。
我的数据是txt文件dataset
# Import and call the needed libraries
import numpy as np
import pandas as pd
#Normalizing Formula
def normalize(value, min_value, max_value, min_norm, max_norm):
new_value = ((max_norm - min_norm)*((value - min_value)/(max_value - min_value))) + min_norm
return new_value
#Split data in three different lists A, B and C
df1 = pd.read_csv('D:\me4.TXT', header=None)
id_set = df1[df1.index % 4 == 0].astype('int').values
A = df1[df1.index % 4 == 1].values
B = df1[df1.index % 4 == 2].values
C = df1[df1.index % 4 == 3].values
data = {'A': A[:,0], 'B': B[:,0], 'C': C[:,0]} # arrays
#df contains all the data
df = pd.DataFrame(data, columns=['A','B','C'], index = id_set[:,0])
df2 = pd.DataFrame(data, index= id_set[0:])
print(df)
#--------------------------------
cycles = int(len(df)/480)
print(cycles)
#next iteration create all plots, change the numer of cycles
for i in df:
min_val = df[i].min()
max_val = df[i].max()
if i=='C':
#Applying normalization for C between [-40,+150]
data['C'] = normalize(df[i].values, min_val, max_val, -40, 150)
elif i=='A':
#Applying normalization for A , B between [-1,+1]
data['A'] = normalize(df[i].values, min_val, max_val, -1, 1)
else:
data['B'] = normalize(df[i].values, min_val, max_val, -1, 1)
norm_data = pd.DataFrame(data)
print(norm_data)
norm_data.to_csv('norm.csv')
df2.to_csv('my_file.csv')
print(df2)
问题是在@Lucas的帮助下进行规范化之后,我错过了我的索引标记为id_set
的问题。
到目前为止,我在my_file.csv中得到的输出低于此错误TypeError
unsupported format string passed to numpy.ndarray.__format__:
id_set A B C
['0'] 2.291171 -2.689658 -344.047912
['10'] 2.176816 -4.381186 -335.936524
['20'] 2.291171 -2.589725 -342.544885
['30'] 2.176597 -6.360999 0.000000
['40'] 2.577268 -1.993412 -344.326376
['50'] 9.844076 -2.690917 -346.125859
['60'] 2.061782 -2.889378 -346.378859
['70'] 2.348300 -2.789547 -347.980986
['80'] 6.973350 -1.893454 -337.884738
['90'] 2.520040 -3.087004 -349.209006
那些['']
不需要的 !
标准化后,我想要的输出应如下所示:
id_set A B C
000 -0.716746 0.158663 112.403310
010 -0.726023 0.037448 113.289702
020 -0.716746 0.165824 112.567557
030 -0.726040 -0.104426 150.000000
040 -0.693538 0.208556 112.372881
050 -0.104061 0.158573 112.176238
060 -0.735354 0.144351 112.148590
070 -0.712112 0.151505 111.973514
080 -0.336932 0.215719 113.076807
090 -0.698181 0.130189 111.839319
010 0.068357 -0.019388 114.346421
011 0.022007 0.165824 112.381444
任何想法都将受到欢迎,因为它对我来说很重要。
答案 0 :(得分:0)
如果我对您的理解正确,那么my_file.csv / df2应该看起来像您问题的下半部分输出? 然后,我相信您在df2的构造中只遇到一个错字,您希望索引看起来与df1相同,所以:
df2 = pd.DataFrame(data, index = id_set[:,0])
代替
df2 = pd.DataFrame(data, index= id_set[0:])
(请注意方括号中的内容)。
这将使您的输出文件my_file.csv
如下所示:
,A,B,C
0,2.19117130798,-2.5897247305,-342.54488522400004
10,2.19117130798,-4.3811855641,-335.936524309
20,2.19117130798,-2.5897247305,-342.54488522400004
...
输出文件norm.csv
如下所示:
,A,B,C
0,-1.0,0.16582420581574775,145.05394742081884
1,-1.0,0.037447604422215175,145.9298596578588
2,-1.0,0.16582420581574775,145.05394742081884
...
如果您希望输出文件norm.csv
具有相同的索引(0,10,20而不是0,1,2 ...),则需要将norm_data定义为
norm_data = pd.DataFrame(data, index = id_set[:,0])
代替
norm_data = pd.DataFrame(data)
另外,我应该注意,您的数据包含几个NaN/inf
条目,这弄乱了您的规范化。
您可以使用
替换那些df = df.replace(np.inf, np.nan)
df = df.fillna(0)
(记入this问题/答案),对df2使用相同的值。您也可以使用相同的功能将NaN/inf
条目替换为其他值。