从一个项目中,我得到了一个看起来像这样的词典列表:
METTS MARK = {'salary':365788,'to_messages':807,'deferral_payments':'NaN','total_payments':1061827,'exercised_stock_options':'NaN','bonus':600000,'restricted_stock' :585062,'shared_receipt_with_poi':702,'restricted_stock_deferred':'NaN','total_stock_value':585062,'expenses':94299,'loan_advances':'NaN','from_messages':29,'other':1740,' from_this_person_to_poi':1,'poi':False,'director_fees':'NaN','deferred_income':'NaN','long_term_incentive':'NaN','email_address':'mark.metts@enron.com',' from_poi_to_this_person':38}
我想要做的是获取每个值,对其进行缩放,然后将“ NaN”值替换为0,然后将其返回到字典中的正确位置。
我尝试过的代码如下:
加载包含数据集的字典
with open("final_project_dataset.pkl", "r") as data_file:
data_dict = pickle.load(data_file)
数据集中名为total的键正在创建一个明显的异常值,因此我将其删除了
del data_dict["TOTAL"]
直观地选择我的功能
my_features = [
'poi',
'salary',#
'bonus',#
'exercised_stock_options',#
'total_stock_value',#
'total_payments',
'expenses',
'loan_advances',#
'deferral_payments',
'deferred_income',
'restricted_stock',#
'restricted_stock_deferred',
'long_term_incentive',#
'shared_receipt_with_poi',#
#'from_this_person_to_poi',
#director_fees',
#'from_messages',
#'to_messages',
#'from_poi_to_this_person'
]
keys = data_dict.keys()
values = data_dict.values()
用0个值替换NaN值
list_of_values = []
for key in keys:
tmp_list = []
for feature in my_features:
try:
data_dict[key][feature]
except KeyError:
print "error: key ", feature, " not present"
value = data_dict[key][feature]
if value=="NaN":
value = 0
tmp_list.append( float(value) )
list_of_values.append(tmp_list)
使用最小/最大缩放器进行功能缩放
from sklearn.preprocessing import MinMaxScaler
data_array = np.array(list_of_values)
scaler = MinMaxScaler()
rescaled_data = scaler.fit_transform(data_array)
所以,现在我有了一个看起来像这样的列表列表:
[0。 0.32916568 0.075 0. 0.01279963 0.01025327 0.41221264 0. 0.01569801 1. 0.18366453 0.10365427 0. 0.12715088]
我想将这些重新缩放的值与相应功能一起放入字典...这是我编写的代码:
my_data_dict = []
for key in keys:
key = {}
for x in range( len(rescaled_data) ):
for count in range( len(my_features) ):
key[ my_features[count] ] = rescaled_data[x][count]
my_data_dict.append(key)
但是我会得到一长串具有相同值的字典。例如:
{'salary':0.24744478779905296,'deferral_payments':0.01569801010492397,'total_payments':0.01228550157492107,'loan_advances':0.0,'bonus':0.075,'restricted_stock_deferred':0.1036542684938879,'total_stock_value':: 379,664 0.550692201098954,'exercised_stock_options':0.011200759837784508,'poi':1.0,'deferred_income':1.0,'shared_receipt_with_poi':0.1583046549538127,'restricted_stock':0.17265209213492153,'long_term_incentive':0.013803111652000
{'salary':0.24744478779905296,'deferral_payments':0.01569801010492397,'total_payments':0.01228550157492107,'loan_advances':0.0,'bonus':0.075,'restricted_stock_deferred':0.1036542684938879,'total_stock_value':: 379,664 0.550692201098954,'exercised_stock_options':0.011200759837784508,'poi':1.0,'deferred_income':1.0,'shared_receipt_with_poi':0.1583046549538127,'restricted_stock':0.17265209213492153,'long_term_incentive':0.013803111652000
如何从data_dict(旧字典)中获取键以重新缩放其数据,并将其放到新字典中?
答案 0 :(得分:0)
就像乔·帕滕(Joe Patten)所说的那样,熊猫使事情变得更容易,您可以将字典转换为数据框,进行处理,然后根据需要将其转换回字典。
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
ser = pd.Series(METTS_MARK) #I am using your METTS_MARK
ser.replace('NaN',0,inplace=True)
ser.drop(index="email_address",inplace=True) #to make everything numerical so we can scale, you can add it back later
df = pd.DataFrame(ser)
scaler = MinMaxScaler()
df[0] = scaler.fit_transform(df)
完成后:
newDict = df[0].to_dict()