使用带有功能列的DataTable的ML.NET Build and Train Model仅在运行时已知

时间:2019-07-28 05:58:37

标签: c# machine-learning multilabel-classification ml.net

我试图编写一个C#包装器方法,以使我更容易创建,训练和使用ML.NET分类模型,而不必对包含我的预测变量和目标变量的类进行硬编码。我看了所有示例,也找到了ML.NET文档,但找不到从读取数据到使用模型的完整示例。

下面是我想到的方法。您会注意到,变量“ trainingDataView”和“ dataProcessPipeline”的代码不完整。这是我整天尝试使用各种方法的代码,但无济于事。在交叉验证阶段,我不断收到错误消息,告诉我找不到目标列。

public static ITransformer CreateClassificationModelExample(MLContext mlContext, DataTable data, List<string> featureColumns, String targetColumn)
        {

            //I am stuck here. Ideally I would like to see a code snippet to create a IDataView from the DataTable passed in as parameter
            //and then selecting only the columns in parameter 'featureColumns' and target = parameter 'targetColumn'
            var trainingDataView = ????; 


            // Data process configuration with pipeline data transformations 
            var dataProcessPipeline = mlContext.Transforms.Conversion.MapValueToKey(targetColumn, targetColumn)
                                      .Append(mlContext.Transforms.Categorical.OneHotEncoding(ValToKeys))
                                      .Append(mlContext.Transforms.Concatenate("Features", featureSet))
                                      .Append(mlContext.Transforms.NormalizeMinMax("Features", "Features"))
                                      .AppendCacheCheckpoint(mlContext);


            // Set the training algorithm 
            var trainer = mlContext.MulticlassClassification.Trainers.SdcaMaximumEntropy(labelColumnName: targetColumn, featureColumnName: "Features")
                                     .Append(mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabel", "PredictedLabel"));

            var trainingPipeline = dataProcessPipeline.Append(trainer);

            // Evaluate quality of Model
            var crossValidationResults = mlContext.MulticlassClassification.CrossValidate(trainingDataView, trainingPipeline, numberOfFolds: 5, labelColumnName: targetColumn);

            // Train Model
            ITransformer model = trainingPipeline.Fit(trainingDataView);

            return model;
        }

我已经彻底探索了ML.NET文档,包括LoadFromEnumerable method示例。我也查看了有关该主题的ML.NET博客和烹饪书讨论。

请帮助某人的代码段使上述方法起作用,我相信这也会对许多其他人有所帮助!谢谢!

1 个答案:

答案 0 :(得分:0)

好吧,经过一天多的努力,尽管还没有完全摆脱编译时的修改,但我还是接近了。下面的代码显示了一个包装器,它或多或少地满足了我的要求,尽管它确实要求在编译时知道NUMBER个模型功能,但这更好,但远非理想。

在下面的示例中,我从DataTable创建一个IDataView,它仅将特定列用作预测变量/特征,并将特定列用作分类模型的Target。然后,代码建立了一个训练分类模型(示例显示“ LbfgsMaximumEntropy”模型),使用交叉验证对其进行评估,然后进行训练。我还展示了一些有关如何创建预测引擎和进行预测的代码。注意,此代码假定您有10个预测变量/特征变量。但这10个很容易更改(如下所示,在“观察”类中显示2行)-比每次您想使用新的数据表进行预测时编写一个类要容易得多。

这是代码。因为我不使用Lambda表达式,所以它有点旧了:

public static ITransformer CreateClassificationModel(MLContext mlContext, DataTable data, List<string> predictorColumns, String TargetColumn, Dictionary<string, int> TargetMapper)
        {
            //Create instances of the GENERIC class Observation and set the values from the DataTable
            //using only the required predictor columns and the target column
            List<Observation> observations = new List<Observation>();
            int iRow = 0;
            foreach (DataRow row in data.Rows)
            {
                var obs = new Observation();

                int iFeature = 0;
                foreach (string predictorColumn in predictorColumns)
                {
                    obs.Features[iFeature] = Convert.ToSingle(row[predictorColumn]);
                    iFeature++;
                }
                obs.Target = TargetMapper[row[TargetColumn].ToString()];                
                observations.Add(obs);
                iRow++;
            }

            IEnumerable<Observation> dataNew = observations;

            var definedSchema = SchemaDefinition.Create(typeof(Observation));

            // Read the data into an IDataView with the modified schema supplied in
            IDataView trainingDataView = mlContext.Data.LoadFromEnumerable(observations, definedSchema);

            var featureSet = new String[1];  
            featureSet[0] = "Features";

            // Data process configuration with pipeline data transformations 
            var dataProcessPipeline = mlContext.Transforms.Conversion.MapValueToKey("Target", "Target")
                                      .Append(mlContext.Transforms.Concatenate("Features", featureSet))
                                      .AppendCacheCheckpoint(mlContext);

            // Set the training algorithm 
            var trainer = mlContext.MulticlassClassification.Trainers.LbfgsMaximumEntropy(labelColumnName: "Target", featureColumnName: "Features")
                                      .Append(mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabel", "PredictedLabel"));
            IEstimator<ITransformer> trainingPipeline = trainingPipeline = dataProcessPipeline.Append(trainer);


            // Evaluate quality of Model
            var crossValidationResults = mlContext.MulticlassClassification.CrossValidate(trainingDataView, trainingPipeline, numberOfFolds: 5, labelColumnName: "Target");

            // Train Model
            ITransformer model = trainingPipeline.Fit(trainingDataView);


            return model;
        }

要测试/使用此模型,可以使用以下PredictionEngine(代码段):

List<Observation> testData = GetTestDataList();  //Get some test data as Observations

   // Create a prediction engine from the model for feeding new data.
  var engine = mlContext.Model.CreatePredictionEngine<Observation, ModelOutput>(model);

   //Make a prediction. The result is of type Output, class shown below.        
   var output = engine.Predict(testData[0]);

最后,下面是上述代码中所需的两个类的定义:

public class Observation
    {
        private float[] m_Features = new Single[10];

        [VectorType(10)]
        public float[] Features
        {
            get
            {
                return m_Features;
            }
        }

        public int Target { get; set; }

    }

    public class ModelOutput
    {
        // ColumnName attribute is used to change the column name from
        // its default value, which is the name of the field.
        [ColumnName("PredictedLabel")]
        public Int32 Prediction { get; set; }
        public float[] Score { get; set; }
    }