Question

我正在尝试理解Datacamp Machine Learning in R for beginners

中的R中的KNN算法示例。

我很难理解他们如何执行采样以建立训练和测试数据集。

我可以按照以下代码进行操作：

ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.67, 0.33))

我的理解是，这将创建一个长度等于nrow(iris)的向量，向量值为1或2，并且选择这些值的概率为{{ 1}}和0.67。

因此，我们得到以下输出：

0.33

下一步，他们使用以下代码创建训练集：

> ind
  [1] 1 1 2 1 2 2 1 2 1 1 1 1 2 2 1 1 1 1 2 2 1 1 1 1 2 1 1 1 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1
 [58] 1 1 1 1 2 1 1 1 2 1 2 1 1 2 1 1 1 1 2 2 1 1 1 2 1 1 1 1 1 1 2 2 1 1 1 1 1 1 2 1 2 1 2 1 1 1 1 1 2 1 2 2 1 2 1 1 1
[115] 1 2 1 1 1 2 1 2 1 1 2 1 1 2 1 2 1 2 1 1 2 1 1 1 1 1 2 1 1 1 2 1 1 2 1 1

此行可能会产生一个数据帧，其中包含iris.training <- iris [ind==1, 1:4]的所有行。

ind == 1

我的问题是变量head(iris.training) Sepal.Length Sepal.Width Petal.Length Petal.Width 1 5.1 3.5 1.4 0.2 2 4.9 3.0 1.4 0.2 4 4.6 3.1 1.5 0.2 7 4.6 3.4 1.4 0.3 9 4.4 2.9 1.4 0.2 10 4.9 3.1 1.5 0.1和ind数据集如何相关。也就是说，R怎么知道要从原始iris数据集中提取哪些行（哪些行有ind == 1），因为iris和{{ 1}}数据集。设置ind时，唯一提及iris数据集的方法是使用iris中的ind确定样本大小（要选择的样本数）。

Answer 1

我的问题是变量ind和虹膜数据集如何相关。

不是，但不是。例如，数字1-5与虹膜数据集之间没有内在联系

iris[1:5, ]
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa

^{由reprex package（v0.2.0）于2018-08-05创建。}

先说ind <- sample(c(TRUE, FALSE), nrow(iris), replace=TRUE, prob=c(0.67, 0.33))然后说iris[ind, ]来强调ind是要选择的行的索引，而不是{{1 }}。

关于在R中使用sample（）函数为ML设置训练和测试集的说明。

1 个答案: