Question

在给定预先存在的模板数据集的情况下，R中是否有任何可以生成随机数据集的软件包？

例如，假设我有虹膜数据集：

data(iris)
> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

我想要一个函数random_df(iris)，该函数将生成与虹膜具有相同列但具有随机数据的数据帧（最好是保留原始数据某些统计属性的随机数据，例如，数字变量）。

最简单的方法是什么？

[问题作者的评论移至此处。 -编者注]

我不想从现有数据集中抽取随机行。我想生成与现有数据集具有相同列（和类型）的实际随机数据。理想情况下，如果有某种方法可以保留数字变量数据的统计属性，那将是可取的，但并不需要

Answer 1

如何开始呢？

定义一个模拟来自df的数据的函数，

从numeric中df列的正态分布中抽取样本，其均值和sd与原始数据列相同，并且
从factor列的层次上均匀地绘制样本。

generate_data <- function(df, nrow = 10) {
    as.data.frame(lapply(df, function(x) {
        if (class(x) == "numeric") {
            rnorm(nrow, mean = mean(x), sd = sd(x))
        } else if (class(x) == "factor") {
            sample(levels(x), nrow, replace = T)
        }
    }))
}

例如，如果我们拿iris，我们得到

set.seed(2019)
df <- generate_data(iris)
str(df)
#'data.frame':  10 obs. of  5 variables:
# $ Sepal.Length: num  6.45 5.42 4.49 6.6 4.79 ...
# $ Sepal.Width : num  2.95 3.76 2.57 3.16 3.2 ...
# $ Petal.Length: num  4.26 5.47 5.29 6.19 2.33 ...
# $ Petal.Width : num  0.487 1.68 1.779 0.809 1.963 ...
# $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 3 2 1 2 3 2 1 1 2 3

扩展generate_data函数以考虑其他列类型应该是很直接的。

从现有数据集中生成随机数据集的最佳方法是什么？

1 个答案: