我的ml.net控制台应用程序出了问题。这是我第一次在Visual Studio中使用ml.net,所以我在microsoft.com上关注this教程,这是一种使用二进制分类的情感分析。
我试图以tsv文件的形式处理一些测试数据以获得正面或负面的情绪分析,但在调试中我收到警告时出现1个格式错误和2个错误值。 / p>
我决定在Stack上问你所有伟大的开发人员,看看是否有人可以帮我找到解决方案。
以下是调试图片:
这是我的测试数据的链接:
wiki-data
wiki-test-data
最后,我的代码是那些重现问题的人:
有2个c#文件:SentimentData.cs& Program.cs中。
1 - SentimentData.cs:
using System;
using System.Collections.Generic;
using System.Text;
using Microsoft.ML.Runtime.Api;
namespace MachineLearningTut
{
public class SentimentData
{
[Column(ordinal: "0")]
public string SentimentText;
[Column(ordinal: "1", name: "Label")]
public float Sentiment;
}
public class SentimentPrediction
{
[ColumnName("PredictedLabel")]
public bool Sentiment;
}
}
2 - Program.cs:
using System;
using Microsoft.ML.Models;
using Microsoft.ML.Runtime;
using Microsoft.ML.Runtime.Api;
using Microsoft.ML.Trainers;
using Microsoft.ML.Transforms;
using System.Collections.Generic;
using System.Linq;
using Microsoft.ML;
using Microsoft.ML.Data;
using System.Threading.Tasks;
namespace MachineLearningTut
{
class Program
{
const string _dataPath = @".\Data\wikipedia-detox-250-line-data.tsv";
const string _testDataPath = @".\Data\wikipedia-detox-250-line-test.tsv";
const string _modelpath = @".\Data\Model.zip";
static async Task Main(string[] args)
{
var model = await TrainAsync();
Evaluate(model);
Predict(model);
}
public static async Task<PredictionModel<SentimentData, SentimentPrediction>> TrainAsync()
{
var pipeline = new LearningPipeline();
pipeline.Add(new TextLoader (_dataPath).CreateFrom<SentimentData>());
pipeline.Add(new TextFeaturizer("Features", "SentimentText"));
pipeline.Add(new FastForestBinaryClassifier() { NumLeaves = 5, NumTrees = 5, MinDocumentsInLeafs = 2 });
PredictionModel<SentimentData, SentimentPrediction> model = pipeline.Train<SentimentData, SentimentPrediction>();
await model.WriteAsync(path: _modelpath);
return model;
}
public static void Evaluate(PredictionModel<SentimentData, SentimentPrediction> model)
{
var testData = new TextLoader(_testDataPath).CreateFrom<SentimentData>();
var evaluator = new BinaryClassificationEvaluator();
BinaryClassificationMetrics metrics = evaluator.Evaluate(model, testData);
Console.WriteLine();
Console.WriteLine("PredictionModel quality metrics evaluation");
Console.WriteLine("-------------------------------------");
Console.WriteLine($"Accuracy: {metrics.Accuracy:P2}");
Console.WriteLine($"Auc: {metrics.Auc:P2}");
Console.WriteLine($"F1Score: {metrics.F1Score:P2}");
}
public static void Predict(PredictionModel<SentimentData, SentimentPrediction> model)
{
IEnumerable<SentimentData> sentiments = new[]
{
new SentimentData
{
SentimentText = "Please refrain from adding nonsense to Wikipedia."
},
new SentimentData
{
SentimentText = "He is the best, and the article should say that."
}
};
IEnumerable<SentimentPrediction> predictions = model.Predict(sentiments);
Console.WriteLine();
Console.WriteLine("Sentiment Predictions");
Console.WriteLine("---------------------");
var sentimentsAndPredictions = sentiments.Zip(predictions, (sentiment, prediction) => (sentiment, prediction));
foreach (var item in sentimentsAndPredictions)
{
Console.WriteLine($"Sentiment: {item.sentiment.SentimentText} | Prediction: {(item.prediction.Sentiment ? "Positive" : "Negative")}");
}
Console.WriteLine();
}
}
}
如果有人想查看解决方案的代码或更多详细信息,请在聊天中询问我,然后发送。提前致谢!!! [竖起大拇指]
答案 0 :(得分:1)
我认为我得到了一个修复程序。要更新的几件事情:
首先,我认为您已将SentimentData
属性切换为数据所包含的内容。尝试将其更改为
[Column(ordinal: "0", name: "Label")]
public float Sentiment;
[Column(ordinal: "1")]
public string SentimentText;
其次,使用useHeader
方法中的TextLoader.CreateFrom
参数。别忘了将其添加到另一个验证数据中。
pipeline.Add(new TextLoader(_dataPath).CreateFrom<SentimentData>(useHeader: true));
通过这两个更新,我得到了以下输出。看起来像AUC为85%的漂亮型号!
答案 1 :(得分:0)
有助于文本类型数据集的另一件事是指示文本有引号:
TextLoader("someFile.txt").CreateFrom<Input>(useHeader: true, allowQuotedStrings: true)
答案 2 :(得分:-1)
252和253行的格式值不正确。请允许我在那里包含分隔符charachter的字段。 如果您发布代码或示例数据,我们可以更精确。