我有一个有点混乱的数据框架,其中科目排名,但有些与排名并列。
subj<-c("A","B","C,D,E","C,D,E","C,D,E","F","G,H","G,H","I")
rank<-c(1,2,3,4,5,6,7,8,9)
df<-data.frame(rank,subj)
df
rank subj
1 1 A
2 2 B
3 3 C,D,E
4 4 C,D,E
5 5 C,D,E
6 6 F
7 7 G,H
8 8 G,H
9 9 I
在个人被束缚的情况下,我需要将他们的队伍表达为平局位置的平均值。像
这样的东西 n.rank n.subj
1 1.0 A
2 2.0 B
3 4.0 C
4 4.0 D
5 4.0 E
6 6.0 F
7 7.5 G
8 7.5 H
9 9.0 I
我已尝试使用strngsplit()
并按排名命名列表元素,但我最终得到的数据框似乎同样难以处理。
a<-strsplit(as.character(df$subj),",")
names(a)<-df$rank
b<-melt(a)
colnames(b)<-c("n.subj","n.rank")
b[1:10,]
n.subj n.rank
1 A 1
2 B 2
3 C 3
4 D 3
5 E 3
6 C 4
7 D 4
8 E 4
9 C 5
10 D 5
当我使用gregexpr()
和regmatches()
来尝试识别需要平均的排名时,我也会走到尽头。
m<-gregexpr(",+",df$subj)
df$no.avg<-melt(lapply(regmatches(df$subj, m),length))[,1]+1
df
rank subj no.avg
1 1 A 1
2 2 B 1
3 3 C,D,E 3
4 4 C,D,E 3
5 5 C,D,E 3
6 6 F 1
7 7 G,H 2
8 8 G,H 2
9 9 I 1
那里有创意解决方案吗?非常感谢。
答案 0 :(得分:3)
这是我的尝试。我首先计算平均等级,然后将相同等级的主题分成行。
library(tidyverse)
options(stringsAsFactors = FALSE)
subj <- c("A", "B", "C,D,E", "C,D,E", "C,D,E", "F", "G,H", "G,H", "I")
rank <- c(1, 2, 3, 4, 5, 6, 7, 8, 9)
df <- data.frame(rank, subj)
df %>%
group_by(subj) %>%
summarise(rank = mean(rank)) %>%
rowwise() %>%
do(tibble(subj = unlist(strsplit(.$subj, ",")), rank = .$rank)) %>%
ungroup()
输出:
# A tibble: 9 × 2
subj rank
* <chr> <dbl>
1 A 1.0
2 B 2.0
3 C 4.0
4 D 4.0
5 E 4.0
6 F 6.0
7 G 7.5
8 H 7.5
9 I 9.0
另一种方法:
m <- aggregate(rank~subj, data=df, mean)
m <- apply(m, 1, function(x) data.frame(subj = unlist(strsplit(x[1], ",")), rank = x[2]))
m <- do.call(rbind, m)
rownames(m) <- NULL
m
输出:
subj rank
1 A 1.0
2 B 2.0
3 C 4.0
4 D 4.0
5 E 4.0
6 F 6.0
7 G 7.5
8 H 7.5
9 I 9.0
答案 1 :(得分:2)
data.table
版本:
#library(data.table) #version 1.9.8
setDT(df)
df[, .(subj=unlist(strsplit(subj[1], ",")), rank=mean(rank)), by=subj][,-1]
# subj rank
#1: A 1.0
#2: B 2.0
#3: C 4.0
#4: D 4.0
#5: E 4.0
#6: F 6.0
#7: G 7.5
#8: H 7.5
#9: I 9.0
答案 2 :(得分:2)
我的splitstackshape
和aggregate
版本。逻辑相同,我们用逗号分隔字符串,然后按subj
取平均值。
library(splitstackshape)
aggregate(rank~subj, cSplit(df, "subj", ",", "long"), mean)
# subj rank
#1 A 1.0
#2 B 2.0
#3 C 4.0
#4 D 4.0
#5 E 4.0
#6 F 6.0
#7 G 7.5
#8 H 7.5
#9 I 9.0
其中
cSplit(df, "subj", ",", "long")
给出
# rank subj
# 1: 1 A
# 2: 2 B
# 3: 3 C
# 4: 3 D
# 5: 3 E
# 6: 4 C
# 7: 4 D
# 8: 4 E
# 9: 5 C
#10: 5 D
#11: 5 E
#12: 6 F
#13: 7 G
#14: 7 H
#15: 8 G
#16: 8 H
#17: 9 I
答案 3 :(得分:0)
以下是使用tidyverse
的另一个选项。数据集将转换为“长”数据集。通过分割&#39; subj&#39;使用separate_rows
的列,然后按&#39; subj&#39;分组,获得&#39;等级的mean
library(tidyverse)
separate_rows(df, subj) %>%
group_by(subj) %>%
summarise(rank = mean(rank))
# A tibble: 9 × 2
# subj rank
# <chr> <dbl>
#1 A 1.0
#2 B 2.0
#3 C 4.0
#4 D 4.0
#5 E 4.0
#6 F 6.0
#7 G 7.5
#8 H 7.5
#9 I 9.0