我有一个示例数据框act
,其中包含两个看起来像这样的变量:
activity_id activity_ids
1 227 {227,32,33,34,35,252}
2 32 {227,32,33,34,35,252}
3 33 {227,32,33,34,35,252}
4 34 {227,32,33,34,35,252}
5 35 {227,32,33,34,35,252}
6 252 {227,32,33,34,35,252}
7 227 {227,32,33,34,35,252}
8 32 {227,32,33,34,35,252}
9 33 {227,32,33,34,35,252}
10 34 {227,32,33,34,35,252}
activity_id
是整数变量,activity_ids
是字符变量。
现在,我想添加一个新的布尔变量,说last_activity
,它通过检查activity_id
的值是否是{{1}的花括号中的数字集合中的最后一个数字来返回true或false。 1}}变量。对于此示例数据,此新变量activity_ids
应仅针对第6行返回last_activity
(因为252是最后一个数字),并且对于所有其他行返回TRUE
。此外,在此示例数据中,FALSE
变量在大括号内有6个数字。它可以在大括号内包含任意数量的值。所以,我需要一个可以为任意数量的值推广的代码。
谢谢!
答案 0 :(得分:3)
使用基本R选项,sub
可以在这里工作:
df <- data.frame(activity_id=c(227, 252),
activity_ids=c("{227,32,33,34,35,252}", "{227,32,33,34,35,252}"))
df$last_activity <- df$activity_id == sub(".*,(\\d+)\\}$", "\\1", df$activity_ids)
df
activity_id activity_ids last_activity
1 227 {227,32,33,34,35,252} FALSE
2 252 {227,32,33,34,35,252} TRUE
答案 1 :(得分:2)
修改强>
我刚刚意识到,当activity_ids
包含额外的信息时,原始方法存在问题。例如,
df$activity_ids[6] <- "{227,32,33,34,35,2521}"
mapply(function(x, y) grepl(y, tail(x, 1), fixed = TRUE),
strsplit(df$activity_ids, ","), df$activity_id)
#[1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
仍然会返回错误的TRUE
。
为了克服这一点,我们可以改为提取最后一个值的数字部分,然后将其与activity_id
进行比较
mapply(function(x, y) y == sub("[^0-9]","",tail(x, 1)),
strsplit(df$activity_ids, ","), df$activity_id)
#[1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
df$activity_ids[6] <- "{227,32,33,34,35,2521}"
mapply(function(x, y) y == sub("[^0-9]","",tail(x, 1)),
strsplit(df$activity_ids, ","), df$activity_id)
#[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
原始答案
非正则表达式选项将字符串拆分为&#34;,&#34;并使用tail
获取最后一个值,并检查activity_id
中是否存在grepl
。
df$last_activity <- mapply(function(x, y) grepl(y, tail(x, 1), fixed = TRUE),
strsplit(df$activity_ids, ","), df$activity_id)
# activity_id activity_ids last_activity
#1 227 {227,32,33,34,35,252} FALSE
#2 32 {227,32,33,34,35,252} FALSE
#3 33 {227,32,33,34,35,252} FALSE
#4 34 {227,32,33,34,35,252} FALSE
#5 35 {227,32,33,34,35,252} FALSE
#6 252 {227,32,33,34,35,252} TRUE
#7 227 {227,32,33,34,35,252} FALSE
#8 32 {227,32,33,34,35,252} FALSE
#9 33 {227,32,33,34,35,252} FALSE
#10 34 {227,32,33,34,35,252} FALSE
答案 2 :(得分:1)
正则表达式方法是使用stri_extract_last_regex
包中的stringi
从字符串中提取最后一个数字,并将其与activity_id
进行比较
library(stringi)
df$activity_id == stri_extract_last_regex(df$activity_ids, "[0-9]+")
#[1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
答案 3 :(得分:0)
基地R:
transform(dat,s=Vectorize(grepl)(paste0(activity_id,"}"),activity_ids))
activity_id activity_ids s
1 227 {227,32,33,34,35,252} FALSE
2 32 {227,32,33,34,35,252} FALSE
3 33 {227,32,33,34,35,252} FALSE
4 34 {227,32,33,34,35,252} FALSE
5 35 {227,32,33,34,35,252} FALSE
6 252 {227,32,33,34,35,252} TRUE
7 227 {227,32,33,34,35,252} FALSE
8 32 {227,32,33,34,35,252} FALSE
9 33 {227,32,33,34,35,252} FALSE
10 34 {227,32,33,34,35,252} FALSE
为了加快计算速度,请使用包stringi
stringi::stri_detect_fixed(dat$activity_ids,paste0(dat$activity_id,"}"))
[1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
答案 4 :(得分:0)
在基础apply()
中使用R
列方式的另一种方法:
cols <- c('activity_id', 'activity_ids')
df$last_activity <- apply(df[cols], 1, function(col) {
x <- unlist(strsplit(col['activity_ids'], "[{},]"))
return(col['activity_id'] == x[length(x)])
})
或使用mapply()
:
df$last_activity <- mapply(function(x,y) {x == y[length(y)]},
x = df$activity_id,
y = strsplit(df$activity_ids, "[{},]")
)
两者都会产生
activity_id activity_ids last_activity
1 227 {227,32,33,34,35,252} FALSE
2 32 {227,32,33,34,35,252} FALSE
3 33 {227,32,33,34,35,252} FALSE
4 34 {227,32,33,34,35,252} FALSE
5 35 {227,32,33,34,35,252} FALSE
6 252 {227,32,33,34,35,252} TRUE
7 227 {227,32,33,34,35,252} FALSE
8 32 {227,32,33,34,35,252} FALSE
9 33 {227,32,33,34,35,252} FALSE
10 34 {227,32,33,34,35,252} FALSE
11 212 somejunk FALSE