Python-将缺少值的分类变量转换为一键编码的变量-Pandas

时间:2018-07-12 05:40:17

标签: python pandas one-hot-encoding

我有一个带有分类变量的数据集。这些分类变量具有丢失的数据。根据我的理解,我想获得最常见的分类值来估算NaN。

据我了解,Scikit学习了Onehot编码器,但最终删除了列标签。我们还必须在onehot编码之前先将变量编码为数字。

我还看到熊猫有一个get_dummies函数给我虚拟变量,它还为NaNs创建了一个变量。问题在于估算值。

我创建了这个:

Caused by: java.lang.IllegalArgumentException: There is no suitable accessor for 'docdates' on class 'class com.ltchie.mco.entity.EPDocument'
        at com.hazelcast.query.impl.getters.ReflectionHelper.createGetter(ReflectionHelper.java:168)
        at com.hazelcast.query.impl.getters.Extractors.instantiateGetter(Extractors.java:124)
        at com.hazelcast.query.impl.getters.Extractors.getGetter(Extractors.java:101)
        at com.hazelcast.query.impl.getters.Extractors.extract(Extractors.java:63)
        at com.hazelcast.query.impl.QueryableEntry.extractAttributeValueFromTargetObject(QueryableEntry.java:144)
        at com.hazelcast.query.impl.QueryableEntry.extractAttributeValue(QueryableEntry.java:82)
        at com.hazelcast.query.impl.QueryableEntry.getAttributeValue(QueryableEntry.java:48)
        at com.hazelcast.query.impl.predicates.AbstractPredicate.readAttributeValue(AbstractPredicate.java:130)
        at com.hazelcast.query.impl.predicates.AbstractPredicate.apply(AbstractPredicate.java:55)
        at com.hazelcast.query.SqlPredicate.apply(SqlPredicate.java:72)
        at com.hazelcast.mapreduce.aggregation.impl.PredicateSupplier.apply(PredicateSupplier.java:58)
        at com.hazelcast.mapreduce.aggregation.impl.SupplierConsumingMapper.map(SupplierConsumingMapper.java:57)
        at com.hazelcast.mapreduce.impl.task.KeyValueSourceMappingPhase.executeMappingPhase(KeyValueSourceMappingPhase.java:49)
        at com.hazelcast.mapreduce.impl.task.MapCombineTask.processMapping(MapCombineTask.java:140)
        at com.hazelcast.mapreduce.impl.task.MapCombineTask.processPartitionMapping(MapCombineTask.java:309)
        at com.hazelcast.mapreduce.impl.task.MapCombineTask.access$700(MapCombineTask.java:70)
        at com.hazelcast.mapreduce.impl.task.MapCombineTask$PartitionBasedProcessor.processPartitions(MapCombineTask.java:361)

有没有更有效的方法将缺少值的分类变量转换为一键编码变量?

编辑: Data.csv:

def produce_dummies(X, label):#gives us the mode for the categorical variable
    dummies = pd.get_dummies(X[label], dummy_na = True)
    mode_label = dummies.sum().sort_values(ascending=False, inplace=False).head(1).index[0]
    dummies[mode_label] += dummies[None]
    del dummies[None]
    X = pd.concat([X, dummies], axis=1)#make the dummies and concat with original data
    X.drop(label, inplace=True, axis=1)#drop the original columns
    return X

使用上述方法

country,age,gender,salary
Poland,,Female,119459
Colombia,73,Male,109999
China,,Male,135392
Indonesia,91,Male,49886
Czech Republic,53,Male,103178
Canada,136,Female,131605
Indonesia,,Male,146128
China,,Female,119298
China,116,Female,123175
Vietnam,114,Male,138617
Palestinian Territory,32,Male,
Pakistan,45,Female,66317
China,140,Female,71870
Indonesia,,Female,
Poland,61,Male,69626
Mexico,149,Male,75329
Indonesia,53,Male,27189
Macedonia,36,Male,104105
Argentina,129,Female,
China,71,Male,74527
Philippines,70,Female,75104
China,133,Male,90145
South Korea,27,Male,92813
Portugal,97,Male,31438
Russia,63,Male,94148
Poland,62,Female,114636
Portugal,,Female,67986
Cuba,65,Female,92651
Indonesia,38,Female,107158
China,23,Female,61712
Saudi Arabia,,Male,58346
Argentina,66,Female,65748
Sweden,60,Male,84878
China,108,Male,73105
Russia,98,Male,
China,26,Female,63551
Brazil,75,Male,
Somalia,141,Female,95899
Thailand,34,Female,115984
China,111,Male,145321
China,29,Male,47681
Argentina,126,Female,75877
Russia,105,Female,90894
Indonesia,133,Male,86528
Philippines,22,Female,87861
Russia,28,Female,139846
Greece,56,Female,68124
Philippines,104,Female,113103
Belarus,101,Male,31092
Equatorial Guinea,141,Male,102130
Czech Republic,136,Female,68754
Indonesia,64,Male,36757
Costa Rica,41,Female,123774
France,83,Male,117321
Portugal,125,Male,20817
Italy,119,Male,134512
Sweden,78,Male,97482
China,146,Female,
China,89,Male,100798
France,36,Male,99792
,63,Female,90067
China,150,Male,
Liberia,67,Female,91082
Japan,,Male,145330
,,Female,96553

0 个答案:

没有答案