在每个实例中应用命令,其中任何一列都采用设定的值范围

时间:2018-04-19 19:47:14

标签: r data.table

我有一个data.table由学生匿名的期末考试成绩组成。每行都是学生,每列都是学科。

如果学生选择了一个科目,他们会得到他们的成绩,如果他们没有,则为NA。我想找到每个学生参加的科目总数。

dt带有这个(num_subjects),但它没有编码以解释一些允许以“双倍”强度进行研究的科目。双重强度表示记录2个等级而不是1。

因此,我需要编写一个代码,用于查找AA或AB或CC等双等级的实例,如果找到,则为所采用的主题数增加+1。

示例数据:请注意,实际上有30多个潜在科目

subject1  subject2  subject3  num_subjects
AA        NA        NA        1
B         BB        C         3
NA        NA        A         1
NA        CC        D         2

所需的输出

subject1  subject2  subject3  num_subjects  new_num_subjects
AA        NA        NA        1             2
B         BB        C         3             4
NA        NA        A         1             1
NA        CC        D         2             3

要为一个主题做这个,我只会这样做:

dt[, new_num_subjects:= num_subjects] 
dt[
(subject1=="AA" | 
subject1=="AB" | 
subject1=="BB" | 
subject1=="BC" | 
subject1=="CC" | 
), new_num_subjects:= num_subjects+1] 

这很有效。但是如何避免为所有30多个科目写这个?

尝试制作所有主题名称的矢量(subjectvec)并循环浏览(见下文),但这不起作用:

for(i in subjectvec)
dt[
( (i) =="AA" | 
(i) =="AB" | 
(i) =="BB" | 
(i) =="BC" | 
(i) =="CC" | 
), new_num_subjects:= num_subjects+1] 

3 个答案:

答案 0 :(得分:0)

你走了。您可以使用[,1:3,5]作为示例来定义计算中要使用的列。

subject1 <- c("AA","B",NA,NA)
subject2 <- c(NA,"BB",NA,"CC")
subject3 <- c(NA,"C","A","D")
num_subjects <- c(1,3,1,2)
grades<-data.frame(subject1,subject2,subject3,num_subjects, stringsAsFactors = FALSE)

grades$new_num_subjects <-
  sapply(1:nrow(grades), function(x)
    sum(nchar(grades[x,1:3]),na.rm = TRUE)
    )

查看输出

grades

  subject1 subject2 subject3 num_subjects new_num_subjects
1       AA     <NA>     <NA>            1                2
2        B       BB        C            3                4
3     <NA>     <NA>        A            1                1
4     <NA>       CC        D            2                3

根据RomanLuštrik的例子,试试看这个是否更快。 lapply采用一个列表,我不确定你使用的那些变量名称有哪些数据类型。我们根据提供的示例回答了您的问题。但是,如果您提供更多数据,我们将更容易提供帮助

grades$new_num_subjects <- 
  apply(grades[, -4], MARGIN = 1, FUN = function(x) sum(sapply(x, nchar), na.rm = TRUE))

答案 1 :(得分:0)

我为一个非常大的模拟数据帧(1M行)运行它,它运行得非常快(尽管不是data.table方法):

# Dummy data
subject <- c("A", "AA", "B", "BB", "C", "CC", NA)
n <- 1000000
grades <- data.frame("s1" = sample(subject, n, replace = T), "s2" = sample(subject, n, replace = T), "s3" = sample(subject, n, replace = T))

# Count subjects
grades$new <- nchar(gsub("NA", "", paste0(grades[,1], grades[,2], grades[,3])))

答案 2 :(得分:0)

使用Lee S的数据,这里有2个基于data.table的编码选项

subject1 <- c("AA","B",NA,NA)
subject2 <- c(NA,"BB",NA,"CC")
subject3 <- c(NA,"C","A","D")
num_subjects <- c(1,3,1,2)

grades<-data.table(subject1,subject2,subject3,num_subjects)

## first option

# 1 paste subject1:subject3 together (but NA becomes character "NA" in paste'ing) with do.call
# 2 change all of those "NA"s to "" with gsub
# 3 find nchar

grades[,num_new := nchar(gsub("NA","",do.call(paste0,.SD))), .SDcols=subject1:subject3]

grades

#   subject1 subject2 subject3 num_subjects num_new
#1:       AA       NA       NA            1       2
#2:        B       BB        C            3       4
#3:       NA       NA        A            1       1
#4:       NA       CC        D            2       3


## Second option
rowSums_na.rm <- function(x) rowSums(x, na.rm = TRUE)
grades[,num_new := rowSums_na.rm(as.data.table(lapply(.SD, nchar))),  
           .SDcols=subject1:subject3]