Question

我有一个庞大的学生数据集，其中有荣誉学生的非标准命名惯例。我需要创建/填充一个新列，该列将根据单词“Honors”返回Y或N以进行字符串匹配

目前我的数据看起来像这样，有超过200,000名学生

library(data.table)
students<-data.table(Student_ID = c(10001:10005), 
                    Degree= c("Bachelor of Laws", "Honours Degree in Commerce", "Bachelor of Laws (with Honours)", "Bachelor of Nursing with Honours", "Bachelor of Nursing"))

我需要添加第三列，以便在创建新列'Honors'数据表方式后，它将填充如下：

students<-data.table(Student_ID = c(10001:10005), 
                      Degree= c("Bachelor of Laws", "Honours Degree in Commerce","Bachelor of Laws (with Honours)", "Bachelor of Nursing with Honours", "Bachelor of Nursing"), 
                      Honours = c("N","Y", "Y", "Y","N"))

非常感谢任何帮助。

另外，按数据表的方式我的意思是：

students[,Honours:="N"]

Answer 1

实际上非常简单

students[, Honours := c("N", "Y")[grepl("Honours", Degree, fixed = TRUE) + 1L]]

您需要做的就是使用某些正则表达式实现函数（例如grepl）搜索“荣誉”（这不是真正的表达式，因此您可以使用fixed = TREU来提高性能）然后根据你的发现（c("N", "Y") / TRUE逻辑向量+ 1L从FALSE做一个向量子集，将其转换为1,2的向量用于从c("N", "Y")）

中减去值

或者，如果这太难阅读，您可以使用ifelse代替

students[, Honours := ifelse(grepl("Honours", Degree, fixed = TRUE), "Y", "N")]

当然，如果“荣誉”可以出现在不同的案例变体中，您可以将grepl来电切换为grepl("Honours", Degree, ignore.case = TRUE)

<强> P.S。

我建议坚持使用逻辑向量，因为之后可以轻松操作它

例如

students[, Honours := grepl("Honours", Degree, fixed = TRUE)]

现在，如果你只想选择有“荣誉”的家伙，你可以做到

students[(Honours)]
#    Student_ID                           Degree Honours
# 1:      10002       Honours Degree in Commerce    TRUE
# 2:      10003  Bachelor of Laws (with Honours)    TRUE
# 3:      10004 Bachelor of Nursing with Honours    TRUE

或没有“荣誉”的人

students[!(Honours)]
#    Student_ID              Degree Honours
# 1:      10001    Bachelor of Laws   FALSE
# 2:      10005 Bachelor of Nursing   FALSE

根据data.table中的逻辑字符串匹配将值分配给新列

1 个答案: