我有一个数据列,其中包含捕获不同事件的列。受访者填写他们经历这些事件的年龄。对于他们多次经历的任何给定事件,他们使用分号分隔经历的年龄(例如,如果经历5、6、7岁,则在该特定栏中输入5; 6; 7)。对于他们没有经历过的事件,受访者将其留空。
由于有20多个列,我将所有列连接在一起成为1个单列,从而产生了一个字符列。我想提取该字符串中的最小数字。我无法将列强制设置为数字数据类型,因为某些事件将被受访者多次经历,并被R解释为字符串(例如“ 5; 6; 7”)
我的数据如下所示:
df <- data.frame(ID = c("001", "002", "003", "004"),
concatenated = c("NA_NA_NA_NA_5; 6_NA_4_NA_NA_NA",
"3_3_NA_NA_NA_3; 4; 5; 6_NA_NA_NA_NA",
"NA_5_4_2_NA_NA_NA_NA_6; 7; 8; 9; 10_NA",
"NA_NA_11_12_11_NA_4; 5; 6_NA_NA_9"))
df$concatenated <- as.character(df$concatenated)
我想要得到的最终结果如下:
ID concatenated smallest_number
1 001 NA_NA_NA_NA_5; 6_NA_4_NA_NA_NA 4
2 002 3_3_NA_NA_NA_3; 4; 5; 6_NA_NA_NA_NA 3
3 003 NA_5_4_2_NA_NA_NA_NA_6; 7; 8; 9; 10_NA 2
4 004 NA_NA_11_12_11_NA_4; 5; 6_NA_NA_9 4
谢谢!非常感谢!
答案 0 :(得分:1)
假设您的数据结构如下:
DF <- data.frame(ID = 1:4,
age = c("5", "5;6;7", "20;15;12", "2;4"),
stringsAsFactors = FALSE)
您可以使用strsplit
将每个年龄段分成一个数字,然后以通常的方式取最小值:
DF$min_age <- vapply(strsplit(DF$age, split = "[^0-9]"),
function(x) min(as.numeric(x), na.rm = TRUE),
double(1))
如果有时不显示数字,则排除那些行
i <- grep("[0-9]", DF$age) # rows with numbers somewhere
DF$min_age <- NA_character_
DF$min_age[i] <- vapply(strsplit(DF$age[i], split = "[^0-9]"),
function(x) min(as.numeric(x), na.rm = TRUE),
double(1))
答案 1 :(得分:1)
我们可以使用gsub
修改元素,使每个项目都有一个下划线分隔的字符串,然后在它们上使用scan
和min
。
df$smallest_number <- sapply(df$concatenated, function(x){
min(scan(text=gsub("; ","_",x), what = numeric(), sep="_"),na.rm=TRUE)})
df
# ID concatenated smallest_number
# 1 001 NA_NA_NA_NA_5; 6_NA_4_NA_NA_NA 4
# 2 002 3_3_NA_NA_NA_3; 4; 5; 6_NA_NA_NA_NA 3
# 3 003 NA_5_4_2_NA_NA_NA_NA_6; 7; 8; 9; 10_NA 2
# 4 004 NA_NA_11_12_11_NA_4; 5; 6_NA_NA_9 4
答案 2 :(得分:1)
使用tidyverse
和splitstackshape
,您可以执行以下操作:
df %>%
mutate(temp = gsub(";", "_", concatenated),
temp = gsub(" ", "", temp)) %>%
cSplit("temp", sep = "_") %>%
gather(var, val, -c(concatenated, ID)) %>%
group_by(ID) %>%
mutate(res = min(val, na.rm = TRUE)) %>%
spread(var, val) %>%
select(ID, concatenated, res)
ID concatenated res
<fct> <chr> <dbl>
1 001 NA_NA_NA_NA_5; 6_NA_4_NA_NA_NA 4.
2 002 3_3_NA_NA_NA_3; 4; 5; 6_NA_NA_NA_NA 3.
3 003 NA_5_4_2_NA_NA_NA_NA_6; 7; 8; 9; 10_NA 2.
4 004 NA_NA_11_12_11_NA_4; 5; 6_NA_NA_9 4.
首先,它将;
替换为_
,并根据_
拆分“已连接”列。其次,它通过“ ID”列将数据从宽格式转换为长格式并分组。最后,它会评估最小值并将数据返回宽格式。
或仅使用tidyverse
:
df %>%
mutate(temp = gsub(";", "_", concatenated),
temp = gsub(" ", "", temp),
temp = strsplit(temp, "_")) %>%
unnest(temp) %>%
group_by(ID) %>%
mutate_if(is.character, as.numeric) %>%
mutate(res = min(temp, na.rm = TRUE),
rowid = row_number()) %>%
spread(rowid, temp) %>%
select(ID, concatenated , res)
ID concatenated res
<fct> <fct> <dbl>
1 001 NA_NA_NA_NA_5; 6_NA_4_NA_NA_NA 4.
2 002 3_3_NA_NA_NA_3; 4; 5; 6_NA_NA_NA_NA 3.
3 003 NA_5_4_2_NA_NA_NA_NA_6; 7; 8; 9; 10_NA 2.
4 004 NA_NA_11_12_11_NA_4; 5; 6_NA_NA_9 4.
答案 3 :(得分:1)
library(stringr)
df$smallest_number <- sapply(
str_extract_all(df$concatenated, "[0-9]+"),
function(x) min(as.integer(x))
)
df
ID concatenated smallest_number
1 001 NA_NA_NA_NA_5; 6_NA_4_NA_NA_NA 4
2 002 3_3_NA_NA_NA_3; 4; 5; 6_NA_NA_NA_NA 3
3 003 NA_5_4_2_NA_NA_NA_NA_6; 7; 8; 9; 10_NA 2
4 004 NA_NA_11_12_11_NA_4; 5; 6_NA_NA_9 4