我正在尝试根据指定条件过滤ID。例如,我想过滤在治疗前和治疗后问卷得分上有特定差异的ID。这样做的目的是获得ID得分提高,保持不变或恶化的ID。这是我要实现的模拟数据集:-
ID<-c("aaa","bbb","ccc","ddd","eee","fff","ggg","hhh","iii","aaa","bbb","ccc","ddd","eee","fff","ggg","hhh","iii","aaa","bbb","ccc","aaa","bbb","ccc")
Condition<-c("Pre","Pre","Pre","Pre","Pre","Pre","Pre","Pre","Pre","Post","Post","Post","Post","Post","Post","Post","Post","Post","Pre","Pre","Pre","Post","Post", "Post")
Score<-c(23,20,19,15,22,22,20,19,18,17,17,19,20,22,22,14,15,10,23,23,21,20,18,11)
df<-cbind(ID,Condition,Score)
df<-as.data.frame(df)
df$Condition<-as.factor(df$Condition)
这里的主要问题是,在ID中,两次出现在ID之前和之后的ID都出现了。
我尝试使用dplyr
解决方案从主数据帧中选择适当的列,然后使用tidyverse
和spread
函数转换为宽格式,因为在那里我可以很容易地计算出差异。但是,我遇到了一个特殊的问题。这是行不通的,因为有重复的实例,其中又有一个ID再次出现在数据中(例如,ID aaa,bbb和ccc)。
df2<-df%>%
group_by(ID)%>%
spread(Condition, Score)
这使我收到以下错误消息:-
错误:输出的每一行必须由键的唯一组合标识。 密钥共享12行: * 10、22 * 11、23 * 12、24 * 1、19 * 2,20 * 3、21 您是否需要使用tibble :: rowid_to_column()创建唯一的ID?
理想情况下,我想要的结果是这样的:-
#improved
ID Pre Post Difference
aaa 23 17 -6
bbb 20 17 -3
ggg 20 14 -6
hhh 19 15 -4
iii 18 10 -8
aaa 23 20 -3
bbb 23 18 -5
ccc 21 11 -10
#no improvement
ID Pre Post Difference
ccc 19 19 0
eee 22 22 0
fff 22 22 0
#worsened
ID Pre Post Difference
ddd 15 20 +5
或者类似的东西。只要允许我包括重复的ID。理想情况下,我希望能够根据差异的大小有条件地进一步过滤。因此,例如,如果我想对ID进行子集/过滤,其ID的得分会提高5分以上,或者差的ID得分会超过5分。裸露一点,我实际的数据集将拥有比示例更多的ID。我刚刚编造并提供。一如既往,任何帮助将不胜感激。
预先感谢您:)
答案 0 :(得分:2)
一种选择是先将numeric
中的'Score'转换为factor
,并按'ID''Condition'分组,创建一个序列列('rn'),{{1} }转换为“宽”格式,获取“后”和“前”分数的差异,并通过“差异”列的spread
获得split
,以创建{{1 }} s
sign
注意:建议不要使用list
,因为tibble
转换为library(tidyverse)
df %>%
mutate(Score = as.numeric(as.character(Score))) %>%
group_by(ID, Condition) %>%
mutate(rn = row_number()) %>%
spread(Condition, Score) %>%
mutate(Difference = Post -Pre) %>%
ungroup %>%
select(-rn) %>%
group_split(grp = sign(Difference), keep = FALSE)
#[[1]]
# A tibble: 8 x 4
# ID Post Pre Difference
# <fct> <dbl> <dbl> <dbl>
#1 aaa 17 23 -6
#2 aaa 20 23 -3
#3 bbb 17 20 -3
#4 bbb 18 23 -5
#5 ccc 11 21 -10
#6 ggg 14 20 -6
#7 hhh 15 19 -4
#8 iii 10 18 -8
#[[2]]
# A tibble: 3 x 4
# ID Post Pre Difference
# <fct> <dbl> <dbl> <dbl>
#1 ccc 19 19 0
#2 eee 22 22 0
#3 fff 22 22 0
#[[3]]
# A tibble: 1 x 4
# ID Post Pre Difference
# <fct> <dbl> <dbl> <dbl>
#1 ddd 20 15 5
并且as.data.frame(cbind
仅可以容纳一个类,即如果有一个字符列,所有其他列都将转换为cbind
并用matrix
包装(默认选项为matrix
)。
character
答案 1 :(得分:2)
另一种tidyverse
可能性是:
df %>%
mutate_if(is.factor, as.character) %>%
mutate(Score = as.numeric(Score)) %>%
group_by(Condition) %>%
mutate(ID = make.unique(ID)) %>%
group_by(ID) %>%
mutate(Difference = Score - lag(Score)) %>%
spread(Condition, Score) %>%
summarise_all(max, na.rm = TRUE) %>%
arrange(Difference)
ID Difference Post Pre
<chr> <dbl> <dbl> <dbl>
1 ccc.1 -10 11 21
2 iii -8 10 18
3 aaa -6 17 23
4 ggg -6 14 20
5 bbb.1 -5 18 23
6 hhh -4 15 19
7 aaa.1 -3 20 23
8 bbb -3 17 20
9 ccc 0 19 19
10 eee 0 22 22
11 fff 0 22 22
12 ddd 5 20 15
在这里,首先创建唯一的ID。其次,它计算差异。最后,将其转换为宽格式,并根据差异进行排列。
如果出于某些原因需要根据差异将其拆分,则可以添加@akrun代码的最后一行:
df %>%
mutate_if(is.factor, as.character) %>%
mutate(Score = as.numeric(Score)) %>%
group_by(Condition) %>%
mutate(ID = make.unique(ID)) %>%
group_by(ID) %>%
mutate(Difference = Score - lag(Score)) %>%
spread(Condition, Score) %>%
summarise_all(max, na.rm = TRUE) %>%
group_split(sign(Difference), keep = FALSE)
[[1]]
# A tibble: 8 x 4
ID Difference Post Pre
<chr> <dbl> <dbl> <dbl>
1 aaa -6 17 23
2 aaa.1 -3 20 23
3 bbb -3 17 20
4 bbb.1 -5 18 23
5 ccc.1 -10 11 21
6 ggg -6 14 20
7 hhh -4 15 19
8 iii -8 10 18
[[2]]
# A tibble: 3 x 4
ID Difference Post Pre
<chr> <dbl> <dbl> <dbl>
1 ccc 0 19 19
2 eee 0 22 22
3 fff 0 22 22
[[3]]
# A tibble: 1 x 4
ID Difference Post Pre
<chr> <dbl> <dbl> <dbl>
1 ddd 5 20 15
答案 2 :(得分:1)
由于Score
的呼叫,其他答案地址cbind()
是一个因素。这是Base R,data.table和dplyr的解决方案。
所有解决方案都通过添加其他ID
变量来解决重复的Group
。这样可以使spread
成功。
# Base R ------------------------------------------------------------------
df <- data.frame(ID, Condition, Score)
df$Group <- ave(seq_len(nrow(df)), df$Condition, FUN = seq_along)
df_wide <- reshape(df, timevar = 'Condition', idvar = c('ID', 'Group'), direction = 'wide')
df_wide$Difference <- df_wide$Score.Post - df_wide$Score.Pre
df_wide[order(df_wide$Difference),]
# data.table --------------------------------------------------------------
library(data.table)
dt <- data.table(ID, Condition, Score)
dt[, Group := seq_len(.N), by = Condition]
dt_wide <- dcast(dt, ID + Group ~ Condition, value.var = 'Score')
dt_wide[, Difference := Post - Pre]
dt_wide[order(Difference),]
# dplyr -------------------------------------------------------------------
library(tidyverse)
tib <- tibble(ID, Condition, Score)
tib%>%
group_by(Condition)%>%
mutate(Group = row_number())%>%
ungroup()%>%
spread(key = 'Condition', value = 'Score')%>%
mutate(Difference = Post - Pre)%>%
arrange(Difference)
对于这个非常小型数据集,基数R最快,而data.table最慢。
Unit: milliseconds
expr min lq mean median uq max neval
base_r_way 2.7562 2.98075 3.103155 3.05140 3.12810 6.0653 100
data.table_way 6.6137 7.09705 8.216043 7.44250 8.01885 47.9138 100
dplyr_way 4.7334 5.15005 5.350857 5.25085 5.40395 9.5594 100
和数据:
ID <- c("aaa","bbb","ccc","ddd","eee","fff","ggg","hhh","iii","aaa","bbb","ccc","ddd","eee","fff","ggg","hhh","iii","aaa","bbb","ccc","aaa","bbb","ccc")
Condition <- c("Pre","Pre","Pre","Pre","Pre","Pre","Pre","Pre","Pre","Post","Post","Post","Post","Post","Post","Post","Post","Post","Pre","Pre","Pre","Post","Post", "Post")
Score <- as.integer(c(23,20,19,15,22,22,20,19,18,17,17,19,20,22,22,14,15,10,23,23,21,20,18,11))