Question

假设我有以下df：

  international_plan voice_mail_plan number_vmail_messages
1                 no             yes                    25
2                 no             yes                    26
3                 no              no                     0
4                yes              no                     0
5                yes              no                     0
6                yes              no                     0
-
1000000          no               yes                    20


  total_day_minutes total_day_calls total_day_charge total_eve_minutes
1             265.1             110            45.07             197.4
2             161.6             123            27.47             195.5
3             243.4             114            41.38             121.2
4             299.4              71            50.90              61.9
5             166.7             113            28.34             148.3
6             223.4              98            37.98             220.6
  total_eve_calls total_eve_charge total_night_minutes total_night_calls
1              99            16.78               244.7                91
2             103            16.62               254.4               103
3             110            10.30               162.6               104
4              88             5.26               196.9                89
5             122            12.61               186.9               121
6             101            18.75               203.9               118
-          
1000000       50             20.22               189.23               100

  total_night_charge total_intl_minutes total_intl_calls total_intl_charge
1              11.01               10.0                3              2.70
2              11.45               13.7                3              3.70
3               7.32               12.2                5              3.29
4               8.86                6.6                7              1.78
5               8.41               10.1                3              2.73
6               9.18                6.3                6              1.70
-          
1000000         10.23               7.33               8              2.52

 number_customer_service_calls churn
1                             1    no
2                             1    no
3                             0    no
4                             2    no
5                             3    no
6                             0    no
-          
1000000                       2    yes

我希望尝试使用rsparkling + h2o框架＆＃34;较大的＆＃34;数据，以加深我对如何处理本地机器上的大数据的理解。

如果我可以扩展现有的小数据，这样我就不会浪费时间进行预处理，而是集中精力进行大规模的ML建模，而不是从网上下载大数据。

我正在寻找的是仅根据现有数据（保持相同的列）随机添加数据（即行），例如，数字（正常dist）和数字的一些分布。分类列（维持级别的比例频率），以便我使用R从最初的3333 x 17增加到1000000 x 17的尺寸。这仅用于测试目的。

非常感谢帮助。

预期df：

<Target Name="BeforePublish ">

Answer 1

简单if语句的快速功能会为您提供随机值，然后您可以将cbind.data.frame和merge与您的数据放在一起。

示例数据：

set.seed(1)
df <- data.frame(a = factor(c(1,2,1,2,1), 1:2, labels = c("yes", "no")),
                 b = 1:5,
                 c = rnorm(5))

    a b          c
1 yes 1 -0.6264538
2  no 2  0.1836433
3 yes 3 -0.8356286
4  no 4  1.5952808
5 yes 5  0.3295078

该函数检查数据类型并使用变量的分布返回n随机生成的值：

FUN1 <- function(x, n = 1, seed = 1){
  set.seed(seed)
  if(is.character(x)){
    y <- sample(sort(unique(x)), n, replace = T, prob = table(x))
  }
  if(is.factor(x)){
    y <- sample(levels(x), n, replace = T, prob = table(x))
  }
  if(is.integer(x)){
    y <- round(rnorm(n, mean(x), sd(x)))
  }
  if(!is.integer(x) & is.numeric(x)){
    y <- rnorm(n, mean(x), sd(x))
  }
  return(y)
}

使用lapply：

将其循环到经验数据上

newvalues <- lapply(df, FUN1, n = 10)

$a
 [1] "yes" "yes" "yes" "no"  "yes" "no"  "no"  "no"  "no"  "yes"

$b
 [1] 2 3 2 6 4 2 4 4 4 3

$c
 [1] -0.4727769  0.3057584 -0.6738021  1.6623976  0.4459399 -0.6592326  0.5977084  0.8388290  0.6826185 -0.1642204

现在cbind.data.frame他们与do.call：

df1 <- do.call("cbind.data.frame", newvalues)

> df1
     a b          c
1  yes 2 -0.4727769
2  yes 3  0.3057584
3  yes 2 -0.6738021
4   no 6  1.6623976
5  yes 4  0.4459399
6   no 2 -0.6592326
7   no 4  0.5977084
8   no 4  0.8388290
9   no 4  0.6826185
10 yes 3 -0.1642204

和merge他们：

df2 <- merge(df, df1, all = TRUE)

     a b          c
1  yes 1 -0.6264538
2  yes 2 -0.6738021
3  yes 2 -0.4727769
4  yes 3 -0.8356286
5  yes 3 -0.1642204
6  yes 3  0.3057584
7  yes 4  0.4459399
8  yes 5  0.3295078
9   no 2 -0.6592326
10  no 2  0.1836433
11  no 4  0.5977084
12  no 4  0.6826185
13  no 4  0.8388290
14  no 4  1.5952808
15  no 6  1.6623976

除了merge之外，这个过程相当快。对于非常大的数据，这种合并可能需要一些时间。对于生成和cbind，使用1000万个新行的三个变量进行快速测试只花了不到一秒钟，但合并大约需要一分钟。考虑到数据的最大部分无论如何都是随机生成的，您只能使用生成的数据集，因此完全跳过合并过程。

Answer 2

保存比例的一种快速简便的方法是从列向量/特征中进行自举（替换采样）。

new_df <- as.data.frame(apply(df, 2, function(x) sample(x, 1e6, replace = TRUE)))

如果要根据数值特征的经验分布进行模拟，则可能需要编写自定义函数

如何创建随机的附加行并将其附加到现有数据框？

2 个答案: