我已经注意到,无论测试或训练模型中有多少数据,SentimentAnalysis示例项目中的Microsoft.Ml.Legacy.LearningPipeline.Row计数始终为10。
https://github.com/dotnet/samples/blob/master/machine-learning/tutorials/SentimentAnalysis.sln
有人可以在这里解释10的含义吗?
// LearningPipeline allows you to add steps in order to keep everything together
// during the learning process.
// <Snippet5>
var pipeline = new LearningPipeline();
// </Snippet5>
// The TextLoader loads a dataset with comments and corresponding postive or negative sentiment.
// When you create a loader, you specify the schema by passing a class to the loader containing
// all the column names and their types. This is used to create the model, and train it.
// <Snippet6>
pipeline.Add(new TextLoader(_dataPath).CreateFrom<SentimentData>());
// </Snippet6>
// TextFeaturizer is a transform that is used to featurize an input column.
// This is used to format and clean the data.
// <Snippet7>
pipeline.Add(new TextFeaturizer("Features", "SentimentText"));
//</Snippet7>
// Adds a FastTreeBinaryClassifier, the decision tree learner for this project, and
// three hyperparameters to be used for tuning decision tree performance.
// <Snippet8>
pipeline.Add(new FastTreeBinaryClassifier() { NumLeaves = 50, NumTrees = 50, MinDocumentsInLeafs = 20 });
// </Snippet8>
答案 0 :(得分:2)
调试器仅显示数据的预览-前10行。这里的目标是显示一些示例行,以及如何对每个行进行操作以使调试更加容易。
读取整个训练数据并对其进行所有转换非常昂贵,并且只有在您到达.Train()
时才会发生。由于转换仅在几行上进行,因此在对整个数据集进行操作时其效果可能会有所不同(例如,文本字典可能会更大),但希望在进行完整的训练过程之前预览显示的数据有助于调试并确保将转换应用于正确的列。
如果您有关于如何使其更清楚或更有用的任何想法,那么可以在GitHub上创建问题将是非常好的!