通过引用现有列中的字符串,将值分配给data.table中的新列

时间:2014-11-29 08:59:44

标签: r data.table

我有一个描述大学生课程的变量表。它已经是data.table格式。

其中一列SCA_TITLE包含课程名称。其中包含名称如下的字符串:"信息系统学士" "法学学士和信息系统学士。

我想创建一个名为" DOUBLE DEGREE"的新列。它指定1学生正在学习双学位,并指定0表示不学位。

基本上,SCA_TITLE

符合以下条件之一
  • 有字符串"和Bachelor",或;
  • 单身汉这个词重复两次

新列中的值需要设置为1,如果不是,则需要将其设置为零。

非常感谢任何协助。

SCA_TITLE列如下所示。有465K观测值和65个变量:

204:理学士(环境管理)荣誉学位 205:理学士(医学生物科学)荣誉学位 206:理学学士荣誉学位(科学学者计划) 207:视觉艺术学士学位荣誉学位 208:视觉传播学士荣誉学位

1 个答案:

答案 0 :(得分:1)

你可以尝试

library(data.table)
setDT(df)[,DOUBLE_DEGREE:=as.numeric(grepl('and Bachelor',
                                        SCA_TITLE)|.N>1),by=ID]
df 
df
 #    ID                       SCA_TITLE DOUBLE_DEGREE
 # 1:  3                   Bachelor of A             0
 # 2:  2                   Bachelor of B             1
 # 3:  5                   Bachelor of C             1
 # 4:  4                   Bachelor of D             0
 # 5:  5                   Bachelor of E             1
 # 6:  7                   Bachelor of F             0
 # 7:  2 Bachelor of G and Bachelor of N             1
 # 8:  6                   Bachelor of H             1
 # 9:  6                   Bachelor of I             1
 #10:  2                   Bachelor of J             1

更新

如果您有其他degrees,并且只需考虑Bachelorand Bachelor

  setDT(df1)[, DOUBLE_DEGREE:= as.numeric(sum(grepl('Bachelor',
             SCA_TITLE))>1|grepl('and Bachelor', SCA_TITLE)), by=ID]

 df1
 #    ID                                         SCA_TITLE DOUBLE_DEGREE
 #1:  3                   Honours degree of Bachelor of A             0
 #2:  2                   Honours degree of Bachelor of B             1
 #3:  5                   Honours degree of Bachelor of C             1
 #4:  4                   Honours degree of Bachelor of D             0
 #5:  5                  Honours  degree of Bachelor of E             1
 #6:  7                   Honours degree of Bachelor of F             0
 #7:  9 Honours degree of Bachelor of G and Bachelor of N             1
 #8:  6                                       Some degree             0
 #9:  6                   Honours degree of Bachelor of I             0
 #10: 2                   Honours degree of Bachelor of J             1

数据

df <- structure(list(ID = c(3L, 2L, 5L, 4L, 5L, 7L, 2L, 6L, 6L, 2L), 
SCA_TITLE = c("Bachelor of A", "Bachelor of B", "Bachelor of C", 
"Bachelor of D", "Bachelor of E", "Bachelor of F", 
"Bachelor of G and Bachelor of N",     "Bachelor of H", "Bachelor of I",
"Bachelor of J")), .Names = c("ID", "SCA_TITLE"), row.names = c(NA, -10L),
class = "data.frame")

df1 <-  structure(list(ID = c(3, 2, 5, 4, 5, 7, 9, 6, 6, 2), SCA_TITLE =
c("Honours degree of Bachelor of A", "Honours degree of Bachelor of B",
"Honours degree of Bachelor of C", "Honours degree of Bachelor of D", 
"Honours  degree of Bachelor of E", "Honours degree of Bachelor of F", 
"Honours degree of Bachelor of G and Bachelor of N", "Some degree", 
"Honours degree of Bachelor of I", "Honours degree of Bachelor of J"
 )), .Names = c("ID", "SCA_TITLE"), row.names = c(NA, -10L),
 class = "data.frame")