我正在尝试使用R生成一个随机的人口统计信息集。我希望它按行而不是列生成,以便函数可以基于前一个函数的结果生成行。我知道这可以通过for循环完成(正如我在下面所做的那样)但是在R中循环非常慢。我已经读过你可以使用 apply 或而来尽管做了很多失败的尝试,但我还没弄清楚如何更有效地循环。下面是带循环的功能代码示例。我如何使用应用或 ?
来实现这一目标y <- 1980 ## MedianYr
d <- 0.1 ## Rate of NA responses
AgeFn <- function(y){
Year <- 1900 + as.POSIXlt(Sys.Date())$year
RNormYr <- as.integer((rnorm(1)*10+y))
Age <- Year - RNormYr
}
EduByAge <- function (Age, d) {
ifelse(Age < 17, sample(c("Some High School",NA), size=1,prob=c((1-d),d)),
ifelse(Age > 16 & Age < 19, sample(c("Some High School", "High School Grad",NA), size=1, prob=c(0.085, 0.604,d)),
ifelse(Age > 18 & Age < 21, sample(c("Some High School", "High School Grad", "Associates",NA), size=1,prob=c(0.085, 0.25, 0.354,d)),
ifelse(20 > Age & Age < 23, sample(c("Some High School", "High School Grad", "Associates", "Bachelors",NA), size=1,prob=c(0.085, 0.25, 0.075, 0.279,d)),
ifelse(Age > 22, sample(c("Some High School", "High School Grad", "Associates", "Bachelors", "Masters", "Professional", "Doctorate",NA),size=1,prob=c(0.085, 0.25, 0.075, 0.176, 0.072, 0.019, 0.012,d)), NA)))))
}
GenderFn <- function(d){
Gender1 <- sample(c("Male","Female","Trans", NA), 1, replace=TRUE, prob=c(0.49, 0.5, 0.01, d))
return(Gender1)
}
UserGen <- function(n,s) {
set.seed(s)
Rows <- function(y,d){
Age <- abs(AgeFn(y))
Gender <- GenderFn(d)
Education <- EduByAge(Age,d)
c(i, Age, Gender, Education)
}
df <- data.frame(matrix(NA, ncol = 4, nrow = n))
for(i in (1:n)) {
df[i,] <- Rows(y,d)
}
colnames(df) <- c("ID", "Age", "Gender", "Education")
return(df)
}
答案 0 :(得分:1)
所以,你编写代码的方式意味着你最终会得到一个循环。
apply
用于将函数应用于另一个结构的每个元素。因此,当您想将包含所有年龄的向量传递给其他函数时,它将起作用。但是,它对于运行你所拥有的AgeFn()
函数来说并不是那么热门,因为它不会将你想要迭代的任何东西作为参数。
这是另一种可能性,它会使您获得随机年龄的方法有利于sample
函数。我做了一些假设,但我希望这些解释可以帮助你弄清楚这一切在R中是如何运作的。
y <- 1980 ## MedianYr
d <- 0.1 ## Rate of NA responses
agemin <- 14
agemax <- 90
# The stats guy in me thinks that you might have some
# methodological problems here with how the ages are assigned
# But I'm just going to stick with it for now
EduByAge <- function (Age, d) {
ifelse(Age < 17, sample(c("Some High School",NA), size=1,prob=c((1-d),d)),
ifelse(Age > 16 & Age < 19, sample(c("Some High School", "High School Grad",NA), size=1, prob=c(0.085, 0.604,d)),
ifelse(Age > 18 & Age < 21, sample(c("Some High School", "High School Grad", "Associates",NA), size=1,prob=c(0.085, 0.25, 0.354,d)),
ifelse(20 > Age & Age < 23, sample(c("Some High School", "High School Grad", "Associates", "Bachelors",NA), size=1,prob=c(0.085, 0.25, 0.075, 0.279,d)),
ifelse(Age > 22, sample(c("Some High School", "High School Grad", "Associates", "Bachelors", "Masters", "Professional", "Doctorate",NA),size=1,prob=c(0.085, 0.25, 0.075, 0.176, 0.072, 0.019, 0.012,d)), NA)))))
}
NewUserGen <- function(n,s) {
set.seed(s)
## Start by creating a data frame with IDs
fakedata <- data.frame(ID=1:n)
# Rather than a function, here I just used the built-in sample function
# I am sampling for n ages lying between agemin and agemax
# Using dnorm(), I assume a normal distribution of the ages, with
# mean age equal to today's year minus the "MedianYr" you were using above
# I assume that the mean and the SD are equal, you don't have to do that
# I put in a few extra carriage returns here to make things not quite so
# tight together - figured it would be easier to read.
fakedata$Age <- sample(x=agemin:agemax,size=n,replace=TRUE,
prob=
dnorm(agemin:agemax,
mean=abs(y-as.numeric(format.Date(Sys.Date(),"%Y"))),
sd=abs(y-as.numeric(format.Date(Sys.Date(),"%Y")))))
# I'm sure you know this, but you have some issues here
# namely that you have a probability vector that totals to more than 1.
# You might be getting no NAs as a result.
fakedata$Gender <- sample(c("Male","Female","Trans", NA),
n, replace=TRUE, prob=c(0.49, 0.5, 0.01, d))
# Here is the actually sapply()
fakedata$Edu <- sapply(fakedata$Age,FUN=EduByAge,d=0.1)
return(fakedata)
}
outdata <- NewUserGen(300,10201)
以下是数据在聚合中的显示方式:
outdata$Edu <- factor(outdata$Edu,levels=c("Some High School",
"High School Grad",
"Associates",
"Bachelors",
"Masters",
"Doctorate"),ordered=TRUE)
hist(outdata$Age)
barplot(table((outdata$Gender)))
par(mai=c(3,1,1,1))
barplot(table((outdata$Edu)),las=2)
答案 1 :(得分:0)
我会修改Rows函数以获取ID,而不是使用作用域“i”。
Rows <- function(i, y,d){
Age <- abs(AgeFn(y))
Gender <- GenderFn(d)
Education <- EduByAge(Age,d)
c(i, Age, Gender, Education)
}
然后你可以用lapply调用你的函数:
res1 = lapply(1:3000, function(i){
Rows(i, y, d)
})
仅此一项并没有真正提高速度,但如果您使用的是具有多个内核的计算机,您可以通过其mclapply功能从“多核”库中获得一些用处。
library("multicore")
res2 = mclapply(1:3000, function(i){
Rows(i, y,d)
})
哦,如果您想将结果用作数据帧,您可以这样做:
df = data.frame(do.call(rbind, res1))
答案 2 :(得分:0)
对于main函数,您可以使用apply
函数族中的某些函数,即replicate
。速度的提高来自于R是逐个复制的语言,for
循环不必要地复制数据帧:
UserGen2 <- function(n,s) {
set.seed(s)
Rows <- function(y,d) {
Age <- abs(AgeFn(y))
Gender <- GenderFn(d)
Education <- EduByAge(Age,d)
c(Age, Gender, Education)
}
samp <- t(replicate(n,Rows(y,d)))
colnames(samp) <- c("Age","Gender","Education")
data.frame(ID=seq_len(dim(samp)[1]),samp)
}
您可以做其他改进。