Question

我拥有名称，评分，ratings_count，类型列的数据集。

例如： Movies_Data.csv

   Name             ratings ratings_count Action Adventure Horror Musical Thriller       
    Mad-Max            2           7         1        0       0       0       1
    Mitchell[1975]     3.25        2         1        0       0       0       1
    John Wick          4.23        4         1        0       0       0       0
    Insidious          3.75        10        0        0       1       0       0

我将其分为功能和标签。然后为名称列执行标签编码。

这是分割后我的特征数据集。

功能：

ratings ratings_count Action Adventure Horror Musical Thriller       
   2           7         1        0       0       0       1
   3.25        2         1        0       0       0       1
   4.23        4         1        0       0       0       0
   3.75        10        0        0       1       0       0

现在的问题是，我有 18 'Genre' 列。因此，我认为我的决策树更加重视这些列，而不是 ratings 和 ratings_count 。

就像我是否要求树预测具有以下参数的电影：

ratings:3 ratings_count:2 Action:1 Adventure:0 Horror:0 Musical:0 Thriller:1

由于 ratings：3 接近 3.25 和 ratings_count ，因此显然应该预测 Mitchell [1975] 与我的输入相同。但这是预测 Mad-Max 。我如何增加rating and ratings_count列的重要性？

我是ML的新手。那么，还有其他方法或其他算法可以用于我的更好建议吗？

P.s。我知道我们可以使用神经网络，但我只需要坚持使用基本ML算法。

谢谢！

Answer 1

首先，随机森林几乎总是比决策树带来更好的结果。它们还有更多需要调整的超参数，但这也可以帮助您带来更好的结果。它被称为Ensemble算法，而且效果很好，因为它平均了很多决策树。它具有较少的过度拟合问题，因此应具有更好的性能。

如果仍然遇到问题，则可以尝试融合某些类别（或获取更多数据），以便您的算法可以正确推断评级的重要性。

此外，此问题可能更适合“交叉验证”，在这里您可以提出更多理论问题。

祝你好运！

如何提高决策树中列的重要性？

1 个答案: