如何在R中的花括号内提取整数?

时间:2018-05-02 06:00:25

标签: r regex perl stringr

我有一个示例数据框act,其中包含两个看起来像这样的变量:

   activity_id          activity_ids
1          227 {227,32,33,34,35,252}
2           32 {227,32,33,34,35,252}
3           33 {227,32,33,34,35,252}
4           34 {227,32,33,34,35,252}
5           35 {227,32,33,34,35,252}
6          252 {227,32,33,34,35,252}
7          227 {227,32,33,34,35,252}
8           32 {227,32,33,34,35,252}
9           33 {227,32,33,34,35,252}
10          34 {227,32,33,34,35,252}

activity_id是整数变量,activity_ids是字符变量。

现在,我想添加一个新的布尔变量,说last_activity,它通过检查activity_id的值是否是{{1}的花括号中的数字集合中的最后一个数字来返回true或false。 1}}变量。对于此示例数据,此新变量activity_ids应仅针对第6行返回last_activity(因为252是最后一个数字),并且对于所有其他行返回TRUE。此外,在此示例数据中,FALSE变量在大括号内有6个数字。它可以在大括号内包含任意数量的值。所以,我需要一个可以为任意数量的值推广的代码。

谢谢!

5 个答案:

答案 0 :(得分:3)

使用基本R选项,sub可以在这里工作:

df <- data.frame(activity_id=c(227, 252),
                 activity_ids=c("{227,32,33,34,35,252}", "{227,32,33,34,35,252}"))

df$last_activity <- df$activity_id == sub(".*,(\\d+)\\}$", "\\1", df$activity_ids)
df

      activity_id          activity_ids last_activity
1             227 {227,32,33,34,35,252}         FALSE
2             252 {227,32,33,34,35,252}          TRUE

Demo

答案 1 :(得分:2)

修改

我刚刚意识到,当activity_ids包含额外的信息时,原始方法存在问题。例如,

df$activity_ids[6] <- "{227,32,33,34,35,2521}"

mapply(function(x, y) grepl(y, tail(x, 1), fixed = TRUE),
       strsplit(df$activity_ids, ","), df$activity_id)

#[1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

仍然会返回错误的TRUE

为了克服这一点,我们可以改为提取最后一个值的数字部分,然后将其与activity_id进行比较

mapply(function(x, y) y == sub("[^0-9]","",tail(x, 1)),
       strsplit(df$activity_ids, ","), df$activity_id)

#[1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

df$activity_ids[6] <- "{227,32,33,34,35,2521}"

mapply(function(x, y) y == sub("[^0-9]","",tail(x, 1)),
      strsplit(df$activity_ids, ","), df$activity_id)

#[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

原始答案

非正则表达式选项将字符串拆分为&#34;,&#34;并使用tail获取最后一个值,并检查activity_id中是否存在grepl

df$last_activity <- mapply(function(x, y) grepl(y, tail(x, 1), fixed = TRUE),
                    strsplit(df$activity_ids, ","), df$activity_id)

#   activity_id          activity_ids last_activity
#1          227 {227,32,33,34,35,252}         FALSE
#2           32 {227,32,33,34,35,252}         FALSE
#3           33 {227,32,33,34,35,252}         FALSE
#4           34 {227,32,33,34,35,252}         FALSE
#5           35 {227,32,33,34,35,252}         FALSE
#6          252 {227,32,33,34,35,252}          TRUE
#7          227 {227,32,33,34,35,252}         FALSE
#8           32 {227,32,33,34,35,252}         FALSE
#9           33 {227,32,33,34,35,252}         FALSE
#10          34 {227,32,33,34,35,252}         FALSE

答案 2 :(得分:1)

正则表达式方法是使用stri_extract_last_regex包中的stringi从字符串中提取最后一个数字,并将其与activity_id进行比较

library(stringi)
df$activity_id == stri_extract_last_regex(df$activity_ids, "[0-9]+")

#[1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

答案 3 :(得分:0)

基地R:

transform(dat,s=Vectorize(grepl)(paste0(activity_id,"}"),activity_ids))
   activity_id          activity_ids     s
1          227 {227,32,33,34,35,252} FALSE
2           32 {227,32,33,34,35,252} FALSE
3           33 {227,32,33,34,35,252} FALSE
4           34 {227,32,33,34,35,252} FALSE
5           35 {227,32,33,34,35,252} FALSE
6          252 {227,32,33,34,35,252}  TRUE
7          227 {227,32,33,34,35,252} FALSE
8           32 {227,32,33,34,35,252} FALSE
9           33 {227,32,33,34,35,252} FALSE
10          34 {227,32,33,34,35,252} FALSE

为了加快计算速度,请使用包stringi

stringi::stri_detect_fixed(dat$activity_ids,paste0(dat$activity_id,"}"))
 [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

答案 4 :(得分:0)

在基础apply()中使用R列方式的另一种方法:

cols <- c('activity_id', 'activity_ids')
df$last_activity <- apply(df[cols], 1, function(col) {
  x <- unlist(strsplit(col['activity_ids'], "[{},]"))
  return(col['activity_id'] == x[length(x)])
})

或使用mapply()

df$last_activity <- mapply(function(x,y) {x == y[length(y)]}, 
       x = df$activity_id, 
       y = strsplit(df$activity_ids, "[{},]")
)

两者都会产生

   activity_id          activity_ids last_activity
1          227 {227,32,33,34,35,252}         FALSE
2           32 {227,32,33,34,35,252}         FALSE
3           33 {227,32,33,34,35,252}         FALSE
4           34 {227,32,33,34,35,252}         FALSE
5           35 {227,32,33,34,35,252}         FALSE
6          252 {227,32,33,34,35,252}          TRUE
7          227 {227,32,33,34,35,252}         FALSE
8           32 {227,32,33,34,35,252}         FALSE
9           33 {227,32,33,34,35,252}         FALSE
10          34 {227,32,33,34,35,252}         FALSE
11         212              somejunk         FALSE