Question

在R中采样/拆分数据的常用方法是在行号上使用sample。例如：

require(data.table)
set.seed(1)

population <- as.character(1e5:(1e6-1))  # some made up ID names

N <- 1e4  # sample size

sample1 <- data.table(id = sort(sample(population, N)))  # randomly sample N ids
test <- sample(N-1, N/2, replace = F)
test1 <- sample1[test, .(id)]

问题在于，这对于数据更改不是很可靠。例如，如果我们只删除一个观察值：

sample2 <- sample1[-sample(N, 1)]

样本1和2仍然完全相同：

nrow(merge(sample1, sample2))

[1] 9999

尽管我们已经设置了种子，但相同的行拆分却产生了截然不同的测试集：

test2 <- sample2[test, .(id)]
nrow(test1)

[1] 5000

nrow(merge(test1, test2))

[1] 2653

可以对特定的ID进行采样，但是如果省略或添加了观察值，则此方法将不可靠。

如何使分割对数据的更改更健壮？也就是说，对于不变的观察，测试的分配是否不变，是否不分配掉落的观测，而是重新分配新的观测？

Answer 1

使用哈希函数并在其最后一位数字的mod上进行采样：

md5_bit_mod <- function(x, m = 2L) {
  # Inputs: 
  #  x: a character vector of ids
  #  m: the modulo divisor (modify for split proportions other than 50:50)
  # Output: remainders from dividing the first digit of the md5 hash of x by m
  as.integer(as.hexmode(substr(openssl::md5(x), 1, 1)) %% m)
}

在这种情况下，散列拆分效果更好，因为测试/序列的分配是由每个对象的散列而不是其在数据中的相对位置决定的

test1a <- sample1[md5_bit_mod(id) == 0L, .(id)]
test2a <- sample2[md5_bit_mod(id) == 0L, .(id)]

nrow(merge(test1a, test2a))

[1] 5057

nrow(test1a)

[1] 5057

由于分配是概率性的，因此样本大小不完全是5000，但是由于大数定律，在大样本中这不是问题。

另请参阅：http://blog.richardweiss.org/2016/12/25/hash-splits.html和https://crypto.stackexchange.com/questions/20742/statistical-properties-of-hash-functions-when-calculating-modulo

可复制地将数据分为R中的训练和测试

1 个答案: