Question

我正在构建一个以DictVectorizer开头的管道，它产生一个稀疏矩阵。指定sparse=True会将scipy稀疏矩阵的输出更改为numpy密集矩阵，这很好，但管道中的下一个阶段会抱怨NaN值，这是使用DictVectorizer的明显结果在我的情况下。我希望管道能够考虑丢失的字典值不是不可用，而是为零。

就我所见，{p> Imputer对我没有帮助，因为我想＆＃34;归咎于＆＃34;具有恒定值而不是依赖于该列的其他值的统计值。

以下是我一直在使用的代码：

vectorize = skl.feature_extraction.DictVectorizer(sparse=False)
variance = skl.feature_selection.VarianceThreshold()
knn = skl.neighbors.KNeighborsClassifier(4, weights='distance', p=1)

pipe = skl.pipeline.Pipeline([('vectorize', vectorize),
                            # here be dragons ('fillna', ),
                            ('variance', variance),
                            ('knn', knn)])
pipe.fit(dict_data, labels)

还有一些嘲笑的词典：

dict_data = [{'city': 'Dubai', 'temperature': 33., 'assume_zero_when_missing': 7},
             {'city': 'London', 'temperature': 12.},
             {'city': 'San Fransisco', 'temperature': 18.}]

Notiec在此示例中，大多数词典都缺少assume_zero_when_missing，这会导致后来的估算工具抱怨NaN值：

ValueError：输入包含NaN，无穷大或对于dtype来说太大的值（＆＃39; float64＆＃39;）。

虽然我希望的结果是NaN值将替换为0。

Answer 1

您可以使用DF.fillna将NaNs list dictionaries转换为大熊猫dataframe，然后将df = pd.DataFrame(dict_data) df.fillna(0, inplace=True)填入0 {/ 1}}，如下所示：< / p>

fit

为了将它用作管道估算器中的步骤，您可以自己编写一个自定义类来实现transform和class FillingNans(object): ''' Custom function for assembling into the pipeline object ''' def transform(self, X): nans_replaced = X.fillna(0) return nans_replaced def fit(self, X, y=None): return self方法，如下所示：

pipe = skl.pipeline.Pipeline([('vectorize', vectorize),
                             ('fill_nans', FillingNans()),
                             ('variance', variance),
                             ('knn', knn)])

然后，您可以修改管道中的手动功能选择步骤，如下所示：

    // M is the module
    // ci is the current instruction
    LLVMContext &ctx = M.getContext();
    Type* int32Ty = Type::getInt32Ty(ctx);
    Type* int8Ty = Type::getInt8Ty(ctx);
    Type* voidPtrTy = int8Ty->getPointerTo();

    // Get an identifier for rand()
    Constant* = M.getOrInsertFunction("rand", FunctionType::get(cct.int32Ty, false));

    // Construct the struct and allocate space
    Type* strTy[] = {int32Ty, voidPtrTy};
    Type* t = StructType::create(strTy);
    Instruction* nArg = new AllocaInst(t, "Wrapper Struct", ci);

    // Add Store insts here
    Value* gepArgs[2] = {ConstantInt::get(int32Ty, 0), ConstantInt::get(int32Ty, 0)};
    Instruction* prand = GetElementPtrInst::Create(NULL, nArg, ArrayRef<Value*>(gepArgs, 2), "RandPtr", ci);

    // Get a random number 
    Instruction* tRand = CallInst::Create(getRand, "", ci);

    // Store the random number into the struct
    Instruction* stPRand = new StoreInst(tRand, prand, ci);

sklearn DictVectorizer（sparse = False）具有不同的默认值，Impute a constant

1 个答案: