我有一个data.table
由学生匿名的期末考试成绩组成。每行都是学生,每列都是学科。
如果学生选择了一个科目,他们会得到他们的成绩,如果他们没有,则为NA
。我想找到每个学生参加的科目总数。
dt
带有这个(num_subjects
),但它没有编码以解释一些允许以“双倍”强度进行研究的科目。双重强度表示记录2个等级而不是1。
因此,我需要编写一个代码,用于查找AA或AB或CC等双等级的实例,如果找到,则为所采用的主题数增加+1。
示例数据:请注意,实际上有30多个潜在科目
subject1 subject2 subject3 num_subjects
AA NA NA 1
B BB C 3
NA NA A 1
NA CC D 2
所需的输出
subject1 subject2 subject3 num_subjects new_num_subjects
AA NA NA 1 2
B BB C 3 4
NA NA A 1 1
NA CC D 2 3
要为一个主题做这个,我只会这样做:
dt[, new_num_subjects:= num_subjects]
dt[
(subject1=="AA" |
subject1=="AB" |
subject1=="BB" |
subject1=="BC" |
subject1=="CC" |
), new_num_subjects:= num_subjects+1]
这很有效。但是如何避免为所有30多个科目写这个?
尝试制作所有主题名称的矢量(subjectvec
)并循环浏览(见下文),但这不起作用:
for(i in subjectvec)
dt[
( (i) =="AA" |
(i) =="AB" |
(i) =="BB" |
(i) =="BC" |
(i) =="CC" |
), new_num_subjects:= num_subjects+1]
答案 0 :(得分:0)
你走了。您可以使用[,1:3,5]
作为示例来定义计算中要使用的列。
subject1 <- c("AA","B",NA,NA)
subject2 <- c(NA,"BB",NA,"CC")
subject3 <- c(NA,"C","A","D")
num_subjects <- c(1,3,1,2)
grades<-data.frame(subject1,subject2,subject3,num_subjects, stringsAsFactors = FALSE)
grades$new_num_subjects <-
sapply(1:nrow(grades), function(x)
sum(nchar(grades[x,1:3]),na.rm = TRUE)
)
查看输出
grades
subject1 subject2 subject3 num_subjects new_num_subjects
1 AA <NA> <NA> 1 2
2 B BB C 3 4
3 <NA> <NA> A 1 1
4 <NA> CC D 2 3
根据RomanLuštrik的例子,试试看这个是否更快。 lapply采用一个列表,我不确定你使用的那些变量名称有哪些数据类型。我们根据提供的示例回答了您的问题。但是,如果您提供更多数据,我们将更容易提供帮助
grades$new_num_subjects <-
apply(grades[, -4], MARGIN = 1, FUN = function(x) sum(sapply(x, nchar), na.rm = TRUE))
答案 1 :(得分:0)
我为一个非常大的模拟数据帧(1M行)运行它,它运行得非常快(尽管不是data.table
方法):
# Dummy data
subject <- c("A", "AA", "B", "BB", "C", "CC", NA)
n <- 1000000
grades <- data.frame("s1" = sample(subject, n, replace = T), "s2" = sample(subject, n, replace = T), "s3" = sample(subject, n, replace = T))
# Count subjects
grades$new <- nchar(gsub("NA", "", paste0(grades[,1], grades[,2], grades[,3])))
答案 2 :(得分:0)
使用Lee S的数据,这里有2个基于data.table的编码选项
subject1 <- c("AA","B",NA,NA)
subject2 <- c(NA,"BB",NA,"CC")
subject3 <- c(NA,"C","A","D")
num_subjects <- c(1,3,1,2)
grades<-data.table(subject1,subject2,subject3,num_subjects)
## first option
# 1 paste subject1:subject3 together (but NA becomes character "NA" in paste'ing) with do.call
# 2 change all of those "NA"s to "" with gsub
# 3 find nchar
grades[,num_new := nchar(gsub("NA","",do.call(paste0,.SD))), .SDcols=subject1:subject3]
grades
# subject1 subject2 subject3 num_subjects num_new
#1: AA NA NA 1 2
#2: B BB C 3 4
#3: NA NA A 1 1
#4: NA CC D 2 3
## Second option
rowSums_na.rm <- function(x) rowSums(x, na.rm = TRUE)
grades[,num_new := rowSums_na.rm(as.data.table(lapply(.SD, nchar))),
.SDcols=subject1:subject3]