Question

对于一个网上商店，我想创建一个模型，该模型根据某人的愿望清单提供建议：“某人的愿望清单上有X，我们也建议Y”的方案。问题是由于缺乏适当的标签（我的数据集中没有这些标签）或完全没有足够的数据，培训师无法工作。这会导致float.NAN的数据或预测得分不准确（全部或大多数得分最终都像这样）

我可以使用所有现有的愿望清单以及后续的ProfileId和ItemId（都是整数）。这些按ProfileID-ItemID组合分组（代表了愿望清单上的一个项目，因此拥有3个项目的用户将拥有3个组合）。总的来说，我可以为16.000个用户和50.000个项目使用大约150.000种组合。从训练数据中排除仅出现在单个愿望清单上（或根本不出现）或愿望清单上只有一个项目的用户的项目（上述数字已被过滤）。如果需要，我可以添加额外的数据列，以表示商品所属的类别（玩具，书籍等），价格和其他元数据。

我没有的是评分，因为网上商店没有使用这些评分。因此，我不能用它们来代表“标签”

public class WishlistItem
{
    // these variables are either uint32 or a Single (float) based on the training algorithm.
    public uint ProfileId;
    public uint ItemId; 
    public float Label;
}

我期望解决的问题：

组合或三个：

1）我需要使用其他教练。如果是这样，哪个最合适？

2）我需要为Label变量插入不同的值。如果是这样，应该如何生成？

3）我需要生成不同的“假”数据集来填充训练数据。如果是这样，应该如何生成？

问题的说明和解决问题的尝试失败

我尝试使用不同的培训师来解析数据，以查看最适合我的数据集的方法： FieldAwareFactorizationMachine ， MatrixFactorizationMachine 和 OLSTrainer 。我还尝试将 MatrixFactorizationMachine 用于 LossFunctionType.SquareLossOneClass ，而不是插入愿望清单中ItemId的ProfileID-ItemID组合。（例如，愿望清单中的item1-item2，item2-item3，item1-item3）

这些机器基于其后续教程中找到的信息：

FieldAware：https://xamlbrewer.wordpress.com/2019/04/23/machine-learning-with-ml-net-in-uwp-field-aware-factorization-machine/
MatrixFactorization：https://docs.microsoft.com/en-us/dotnet/machine-learning/tutorials/movie-recommendation
MatrixFactorization（OneClass）：https://medium.com/machinelearningadvantage/build-a-product-recommender-using-c-and-ml-net-machine-learning-ab890b802d25
OLS：https://docs.microsoft.com/en-us/dotnet/api/microsoft.ml.mklcomponentscatalog.ols?view=ml-dotnet

以下是其中一个管道的示例，其他管道非常相似：

string profileEncoded = nameof(WishlistItem.ProfileId) + "Encoded";
string itemEncoded = nameof(WishlistItem.ItemId) + "Encoded";
// the Matrix Factorization pipeline
  var options = new MatrixFactorizationTrainer.Options {
                MatrixColumnIndexColumnName = profileEncoded,
                MatrixRowIndexColumnName = itemEncoded,
                LabelColumnName = nameof(WishlistItem.Label),
                NumberOfIterations = 100,
                ApproximationRank = 100
            };

            trainerEstimator = Context.Transforms.Conversion.MapValueToKey(outputColumnName: profileEncoded, inputColumnName: nameof(WishlistItem.ProfileId))
                       .Append(Context.Transforms.Conversion.MapValueToKey(outputColumnName: itemEncoded, inputColumnName: nameof(WishlistItem.ItemId)))
                            .Append(Context.BinaryClassification.Trainers.FieldAwareFactorizationMachine(new string[] { "Features" }));

为了缓解缺少标签的问题，我尝试了几种解决方法：

将它们保留为空白（浮点值为0f）
使用itemid，profileid或两者的组合的哈希码
计算包含特定itemid或profileid的项目数量，还可以操纵该数字以创建较少的极值，以防某个项目被数百次表示。（使用平方根或对数函数，创建Label = Math.Log(amountoftimes);或Label = Math.Ceiling(Math.Log(amountoftimes)
对于FieldAware机器，其中Label是布尔值而不是Float，则上面的计算用于确定浮点结果是否高于所有项目的平均值，低于平均值

测试时，我使用以下两种可能的方法进行测试，以确定可以为项目“ X ”创建哪些建议“ Y ”：

将ItemID X 与所有现有项目进行比较，并带有用户的ProfileID。


List<WishlistItem> predictionsForUser =  profileMatrix.DistinctBy(x => x.ItemID).Select(x => new WishlistItem(userId, x.GiftId, x.Label));

IDataView transformed = trainedModel.Transform(Context.Data.LoadFromEnumerable(predictionsForUser));

CoPurchasePrediction[] predictions = Context.Data.CreateEnumerable<CoPurchasePrediction>(transformed, false).ToArray();

IEnumerable<KeyValuePair<WishlistItem, CoPurchasePrediction>> results = Enumerable.Range(0, predictions.Length).ToDictionary(x => predictionsForUser[x], x => predictions[x]).Where(x => OrderByDescending(x => x.Value.Score).Take(10);


return results.Select(x => x.Key.GiftId.ToString()).ToArray();

将ItemID X 与也存在 X 的其他人的愿望清单中的项目进行比较。这个用于FieldAware因子分解训练器，该训练器使用布尔作为标签。

public IEnumerable<WishlistItem> CreatePredictDataForUser(string userId, IEnumerable<WishlistItem> userItems)
{
    Dictionary<string, IEnumerable<WishlistItem>> giftIdGroups = profileMatrix.GroupBy(x => x.GiftId).ToDictionary(x => x.Key, x => x.Select(y => y));
    Dictionary<string, IEnumerable<WishlistItem>> profileIdGroup = profileMatrix.GroupBy(x => x.ProfileId).ToDictionary(x => x.Key, x => x.Select(y => y));
            profileIdGroup.Add(userId, userItems);

    List<WishlistItem> results = new List<WishlistItem>();

    foreach (WishlistItem wi in userItems)
    {
       IEnumerable<WishlistItem> giftIdGroup = giftIdGroups[wi.GiftId];
       foreach(WishlistItem subwi in giftIdGroup)
       {
           results.AddRange(profileIdGroup[subwi.ProfileId]);
       }
    }

   IEnumerable<WishlistItem> filtered = results.ExceptBy(userItems, x => x.GiftId);

   // get duplicates
   Dictionary<string, float> duplicates = filtered.GroupBy(x => x.GiftId).ToDictionary(x => x.Key, x => giftLabelValues[x.First().GiftId]);
            float max = duplicates.Values.Max();

    return filtered.DistinctBy(x => x.GiftId).Select(x => new WishlistItem(userId, x.GiftId, duplicates[x.GiftId] * 2 > max));
}

但是，无论插入的是什么项目，测试数据要么全部或部分不可用（float.NAN），要么创建始终相同的推荐结果（对于X项，我们建议Y和Z ）。

使用testdataview（DataOperationsCatalog.TrainTestData split = Context.Data.TrainTestSplit(data, 0.2)）评估数据时，它要么显示高精度的有希望的结果，要么到处都是随机值，并且不等于我得到的结果；高精度仍然会导致float.NAN或“总是相同”

在线指出，float.NAN可能是小数据集的结果。为了补偿，我尝试创建“假”数据集；基于现有profileid和itemid随机生成的profile-item组合（标签为0f或false，其余为0f +或true）。（事先检查以排除这些随机的“负”数据也不是偶然的“真实”组合集）。但是，这几乎没有效果。

Answer 1

我认为您尝试过的任何解决方案都不会奏效，因为正如您所指出的那样，您没有任何标签数据。伪造标签数据也不起作用，因为ML算法可以处理该伪造的标签。

我相信您正在寻找的是一类矩阵分解算法。

您的“标签”或“分数”是隐式的-该项目位于用户的愿望清单中本身表明该标签-用户对该项目感兴趣。一类矩阵分解使用了这种隐式标签。

已通读本文： https://medium.com/machinelearningadvantage/build-a-product-recommender-using-c-and-ml-net-machine-learning-ab890b802d25

Answer 2

您要寻找的是经典的推荐系统解决方案。推荐系统习惯于丢失和稀疏数据。解决此问题的方法很多，我建议从article开始。通常，推荐系统中有两种方法-基于模型和基于内存。以我的经验，基于模型的方法的性能要比基于内存的方法好得多。关于不同的模型和解决方案，有一个不错的摘要here。看看Koren和Bell here提出的矩阵分解解决方案，该解决方案在很多情况下都非常有效。

（ML.NET）如何训练不包含标签的数据集

2 个答案: