获取字符串中的最小数字

时间:2019-01-23 09:11:34

标签: r string

我有一个数据列,其中包含捕获不同事件的列。受访者填写他们经历这些事件的年龄。对于他们多次经历的任何给定事件,他们使用分号分隔经历的年龄(例如,如果经历5、6、7岁,则在该特定栏中输入5; 6; 7)。对于他们没有经历过的事件,受访者将其留空。

由于有20多个列,我将所有列连接在一起成为1个单列,从而产生了一个字符列。我想提取该字符串中的最小数字。我无法将列强制设置为数字数据类型,因为某些事件将被受访者多次经历,并被R解释为字符串(例如“ 5; 6; 7”)

我的数据如下所示:

df <- data.frame(ID = c("001", "002", "003", "004"),
             concatenated = c("NA_NA_NA_NA_5; 6_NA_4_NA_NA_NA",
                              "3_3_NA_NA_NA_3; 4; 5; 6_NA_NA_NA_NA",
                              "NA_5_4_2_NA_NA_NA_NA_6; 7; 8; 9; 10_NA",
                              "NA_NA_11_12_11_NA_4; 5; 6_NA_NA_9"))

df$concatenated <- as.character(df$concatenated)

我想要得到的最终结果如下:

ID                           concatenated smallest_number
1 001         NA_NA_NA_NA_5; 6_NA_4_NA_NA_NA               4
2 002    3_3_NA_NA_NA_3; 4; 5; 6_NA_NA_NA_NA               3
3 003 NA_5_4_2_NA_NA_NA_NA_6; 7; 8; 9; 10_NA               2
4 004      NA_NA_11_12_11_NA_4; 5; 6_NA_NA_9               4

谢谢!非常感谢!

4 个答案:

答案 0 :(得分:1)

假设您的数据结构如下:

DF <- data.frame(ID = 1:4,
                 age = c("5", "5;6;7", "20;15;12", "2;4"),
                 stringsAsFactors = FALSE)

您可以使用strsplit将每个年龄段分成一个数字,然后以通常的方式取最小值:

DF$min_age <- vapply(strsplit(DF$age, split = "[^0-9]"),
                     function(x) min(as.numeric(x), na.rm = TRUE),
                     double(1))

如果有时不显示数字,则排除那些行

i <- grep("[0-9]", DF$age)  # rows with numbers somewhere
DF$min_age <- NA_character_
DF$min_age[i] <- vapply(strsplit(DF$age[i], split = "[^0-9]"),
                        function(x) min(as.numeric(x), na.rm = TRUE),
                        double(1))

答案 1 :(得分:1)

我们可以使用gsub修改元素,使每个项目都有一个下划线分隔的字符串,然后在它们上使用scanmin

df$smallest_number <- sapply(df$concatenated, function(x){
  min(scan(text=gsub("; ","_",x), what = numeric(), sep="_"),na.rm=TRUE)})
df
#    ID                           concatenated smallest_number
# 1 001         NA_NA_NA_NA_5; 6_NA_4_NA_NA_NA               4
# 2 002    3_3_NA_NA_NA_3; 4; 5; 6_NA_NA_NA_NA               3
# 3 003 NA_5_4_2_NA_NA_NA_NA_6; 7; 8; 9; 10_NA               2
# 4 004      NA_NA_11_12_11_NA_4; 5; 6_NA_NA_9               4

答案 2 :(得分:1)

使用tidyversesplitstackshape,您可以执行以下操作:

df %>%
 mutate(temp = gsub(";", "_", concatenated),
        temp = gsub(" ", "", temp)) %>%
 cSplit("temp", sep = "_") %>%
 gather(var, val, -c(concatenated, ID)) %>%
 group_by(ID) %>%
 mutate(res = min(val, na.rm = TRUE)) %>%
 spread(var, val) %>%
 select(ID, concatenated, res)

  ID    concatenated                             res
  <fct> <chr>                                  <dbl>
1 001   NA_NA_NA_NA_5; 6_NA_4_NA_NA_NA            4.
2 002   3_3_NA_NA_NA_3; 4; 5; 6_NA_NA_NA_NA       3.
3 003   NA_5_4_2_NA_NA_NA_NA_6; 7; 8; 9; 10_NA    2.
4 004   NA_NA_11_12_11_NA_4; 5; 6_NA_NA_9         4.

首先,它将;替换为_,并根据_拆分“已连接”列。其次,它通过“ ID”列将数据从宽格式转换为长格式并分组。最后,它会评估最小值并将数据返回宽格式。

或仅使用tidyverse

df %>% 
 mutate(temp = gsub(";", "_", concatenated),
        temp = gsub(" ", "", temp),
        temp = strsplit(temp, "_")) %>%
 unnest(temp) %>%
 group_by(ID) %>%
 mutate_if(is.character, as.numeric) %>%
 mutate(res = min(temp, na.rm = TRUE),
        rowid = row_number()) %>%
 spread(rowid, temp) %>%
 select(ID, concatenated , res)

  ID    concatenated                             res
  <fct> <fct>                                  <dbl>
1 001   NA_NA_NA_NA_5; 6_NA_4_NA_NA_NA            4.
2 002   3_3_NA_NA_NA_3; 4; 5; 6_NA_NA_NA_NA       3.
3 003   NA_5_4_2_NA_NA_NA_NA_6; 7; 8; 9; 10_NA    2.
4 004   NA_NA_11_12_11_NA_4; 5; 6_NA_NA_9         4.

答案 3 :(得分:1)

library(stringr)
df$smallest_number <- sapply(
  str_extract_all(df$concatenated, "[0-9]+"),
  function(x) min(as.integer(x))
)
df
   ID                           concatenated smallest_number
1 001         NA_NA_NA_NA_5; 6_NA_4_NA_NA_NA               4
2 002    3_3_NA_NA_NA_3; 4; 5; 6_NA_NA_NA_NA               3
3 003 NA_5_4_2_NA_NA_NA_NA_6; 7; 8; 9; 10_NA               2
4 004      NA_NA_11_12_11_NA_4; 5; 6_NA_NA_9               4