我想生成一些包含大量数字的数字字符串,在本例中是合成数据集中的ID值。
对于短数字字符串,我会使用sample
:
sprintf("%05.f", sample(0:(1e5-1), 18))
## [1] "54783" "80354" "53607" "99668" "63621" "07121" "15944" "27436" "96837"
## [10] "28751" "95315" "63326" "00981" "15300" "18448" "09885" "63360" "04539"
这对较长的字符串不起作用。首先,内存要求变得太大,然后你不能使数字足够大。例如,这不起作用:
sprintf("%020.f", sample(0:(1e20-1), 18))
## Error in 0:(1e+20 - 1) : result would be too long a vector
如何制作包含大量数字的数字字符串?
答案 0 :(得分:7)
您可以使用stringi
包:
require(stringi)
stri_rand_strings(10,50,pattern="[0-9]")
#[1] "33163217620361477538822791082750025522246331345665"
#[2] "85105858270154002408385176647161448078668054193081"
#[3] "62417899981033664011261714060242781925235001978704"
#[4] "17731152361720663463691231461493607438220463345863"
#[5] "06316044683426574113640145569673845269595104465896"
#[6] "17058300286927387520323781399768150137786864069558"
#[7] "86204984977415277470013113957915963393339586096213"
#[8] "56382530391794208466245591896055134584746907393458"
#[9] "61740570216902905237145952608961548203505061535222"
#[10] "28713530448562268345804947527043822080897315821103"
第一个参数是结果向量的长度,第二个是每个字符串的字符数,第三个是我们只需要数字。
坚持使用base
R,可以尝试生成1000个字符串,每个字符串包含50个数字:
apply(matrix(sample(charToRaw("0123456789"),50*1000,replace=TRUE),nrow=1000),1,rawToChar)
答案 1 :(得分:6)
基础R替代方案:
set.seed(123)
paste0(sample(0:9,50,replace=TRUE),collapse="")
#[1] "27489058549465182039866967552199670472321443112428"
编辑:正如@docendodiscimus所建议的那样,这可以与replicate()
结合使用以获得任意数量的此类字符串:
replicate(10,paste0(sample(0:9,50,replace=TRUE),collapse=""))
# [1] "27489058549465182039866967552199670472321443112428" "04715217836032848874767042363126471498811636317045"
# [3] "53494896419309715954633239101668675687943401822027" "84321352425363357242618766358583725425992396944615"
# [5] "29654832114226073489297603456964502318185616373997" "22525714489869553305800177940671320302062108789107"
# [7] "70776410443470388238821710903962783466694152439326" "19516964381183371044438459723957375912029277122119"
# [9] "91953470363824219340565386331895392614012571877136" "53202887119441522628084764602728369116489047092067"
答案 2 :(得分:3)
强制性竞争:
GNS <- function(nNumbers, nCharsPerNumber)
{
sample(0:9, nNumbers * nCharsPerNumber, replace = TRUE) %>%
split(gl(nNumbers, nCharsPerNumber)) %>%
vapply(paste0, character(1), collapse = "", USE.NAMES = FALSE)
}
GNP <- function(nNumbers,nCharsPerNumber){
replicate(nNumbers,paste0(sample(0:9,nCharsPerNumber,replace=TRUE),collapse=""))
}
GST <- function(nNumbers,nCharsPerNumber){
stri_rand_strings(nNumbers,nCharsPerNumber,pattern="[0-9]")
}
microbenchmark(GNS(1000,100),GNP(1000,100),GST(1000,100),10)
分数......
Unit: milliseconds
expr min lq mean median uq max
GNS(1000, 100) 36.832684 38.918858 40.90260 40.750332 41.374165 46.369622
GNP(1000, 100) 36.808395 39.310571 39.99557 40.094511 40.772055 44.025157
GST(1000, 100) 1.882961 1.923672 2.03537 1.983199 2.166911 2.325648
neval
10
10
10
我们有一个明显的赢家!
编辑:添加另一个基本选项,它甚至更快。
GSAP<- function(nNumbers,nCharsPerNumber){
apply( matrix(sample(charToRaw("0123456789"),nNumbers*nCharsPerNumber,replace=TRUE),nrow=nCharsPerNumber),1, rawToChar ) }
Unit: microseconds
expr min lq mean median uq max
GSAP(1000, 100) 724.584 739.637 821.435 766.8345 899.06 1030.086
GNS(1000, 100) 36189.180 38316.406 39739.471 39141.5695 39965.02 44478.450
GNP(1000, 100) 35777.282 36331.839 38448.665 38575.8945 39725.21 43016.281
GST(1000, 100) 1863.803 1898.013 1944.472 1918.7110 1975.33 2122.094
编辑第二:尝试更大的输入.. 并且这次获得正确的代码
(以秒为单位的时间)
expr min lq mean median uq max neval
GSAP(x, y) 3.906626 3.975160 4.069103 4.049784 4.163262 4.329284 10
GNS(x, y) 33.645200 33.972587 34.513555 34.406009 35.141313 35.328662 10
GNP(x, y) 30.833180 31.136971 33.037422 32.193070 33.010896 41.713811 10
GST(x, y) 1.697303 1.706599 1.731205 1.735127 1.756961 1.763861 10
所以GST小幅上涨。
答案 3 :(得分:2)
生成单个数字,将它们分散在各个数字之间,然后将数字折叠在一起。
library(magrittr)
generateNumberStrings <- function(nNumbers, nCharsPerNumber)
{
sample(0:9, nNumbers * nCharsPerNumber, replace = TRUE) %>%
split(gl(nNumbers, nCharsPerNumber)) %>%
vapply(paste0, character(1), collapse = "", USE.NAMES = FALSE)
}
generateNumberStrings(18, 20)
## [1] "06985095513359117867" "95278964413245221928" "75398392571928201881"
## [4] "00722065797044523279" "24475619649735183646" "29165493966488037145"
## [7] "34289922968745727406" "82354362380114534171" "84293845597888728670"
## [10] "97570546918892201649" "41421884356741221760" "99306177663904189401"
## [13] "25668966612346726451" "94949806854834288664" "43664073601604613019"
## [16] "25848242347176214032" "80736828777283687373" "83763855757083999312"