用子集平均值替换向量值的子集

时间:2017-02-14 03:50:25

标签: r regex subset average

我有一个有点混乱的数据框架,其中科目排名,但有些与排名并列。

    subj<-c("A","B","C,D,E","C,D,E","C,D,E","F","G,H","G,H","I")
    rank<-c(1,2,3,4,5,6,7,8,9)
    df<-data.frame(rank,subj)
    df
       rank  subj
    1    1     A
    2    2     B
    3    3 C,D,E
    4    4 C,D,E
    5    5 C,D,E
    6    6     F
    7    7   G,H
    8    8   G,H
    9    9     I   

在个人被束缚的情况下,我需要将他们的队伍表达为平局位置的平均值。像

这样的东西
    n.rank n.subj
1    1.0      A
2    2.0      B
3    4.0      C
4    4.0      D
5    4.0      E
6    6.0      F
7    7.5      G
8    7.5      H
9    9.0      I

我已尝试使用strngsplit()并按排名命名列表元素,但我最终得到的数据框似乎同样难以处理。

 a<-strsplit(as.character(df$subj),",")
 names(a)<-df$rank
 b<-melt(a)
 colnames(b)<-c("n.subj","n.rank")
 b[1:10,]
     n.subj n.rank
 1       A      1
 2       B      2
 3       C      3
 4       D      3
 5       E      3
 6       C      4
 7       D      4
 8       E      4
 9       C      5
 10      D      5

当我使用gregexpr()regmatches()来尝试识别需要平均的排名时,我也会走到尽头。

    m<-gregexpr(",+",df$subj)
    df$no.avg<-melt(lapply(regmatches(df$subj, m),length))[,1]+1
    df
     rank  subj no.avg
  1    1     A      1
  2    2     B      1
  3    3 C,D,E      3
  4    4 C,D,E      3
  5    5 C,D,E      3
  6    6     F      1
  7    7   G,H      2
  8    8   G,H      2
  9    9     I      1

那里有创意解决方案吗?非常感谢。

4 个答案:

答案 0 :(得分:3)

这是我的尝试。我首先计算平均等级,然后将相同等级的主题分成行。

library(tidyverse)
options(stringsAsFactors = FALSE)
subj <- c("A", "B", "C,D,E", "C,D,E", "C,D,E", "F", "G,H", "G,H", "I")
rank <- c(1, 2, 3, 4, 5, 6, 7, 8, 9)
df <- data.frame(rank, subj)

df %>% 
  group_by(subj) %>% 
  summarise(rank = mean(rank)) %>% 
  rowwise() %>% 
  do(tibble(subj = unlist(strsplit(.$subj, ",")), rank = .$rank)) %>% 
  ungroup()

输出:

# A tibble: 9 × 2
   subj  rank
* <chr> <dbl>
1     A   1.0
2     B   2.0
3     C   4.0
4     D   4.0
5     E   4.0
6     F   6.0
7     G   7.5
8     H   7.5
9     I   9.0

另一种方法:

m <- aggregate(rank~subj, data=df, mean)
m <- apply(m, 1, function(x) data.frame(subj = unlist(strsplit(x[1], ",")), rank = x[2]))
m <- do.call(rbind, m)
rownames(m) <- NULL
m

输出:

subj rank
1    A  1.0
2    B  2.0
3    C  4.0
4    D  4.0
5    E  4.0
6    F  6.0
7    G  7.5
8    H  7.5
9    I  9.0

答案 1 :(得分:2)

data.table版本:

#library(data.table) #version 1.9.8
setDT(df)
df[, .(subj=unlist(strsplit(subj[1], ",")), rank=mean(rank)), by=subj][,-1]

#   subj rank
#1:    A  1.0
#2:    B  2.0
#3:    C  4.0
#4:    D  4.0
#5:    E  4.0
#6:    F  6.0
#7:    G  7.5
#8:    H  7.5
#9:    I  9.0

答案 2 :(得分:2)

我的splitstackshapeaggregate版本。逻辑相同,我们用逗号分隔字符串,然后按subj取平均值。

library(splitstackshape)
aggregate(rank~subj, cSplit(df, "subj", ",", "long"), mean)

#  subj rank
#1    A  1.0
#2    B  2.0
#3    C  4.0
#4    D  4.0
#5    E  4.0
#6    F  6.0
#7    G  7.5
#8    H  7.5
#9    I  9.0

其中

cSplit(df, "subj", ",", "long")

给出

#     rank subj
# 1:    1    A
# 2:    2    B
# 3:    3    C
# 4:    3    D
# 5:    3    E
# 6:    4    C
# 7:    4    D
# 8:    4    E
# 9:    5    C
#10:    5    D
#11:    5    E
#12:    6    F
#13:    7    G
#14:    7    H
#15:    8    G
#16:    8    H
#17:    9    I

答案 3 :(得分:0)

以下是使用tidyverse的另一个选项。数据集将转换为“长”数据集。通过分割&#39; subj&#39;使用separate_rows的列,然后按&#39; subj&#39;分组,获得&#39;等级的mean

library(tidyverse)
separate_rows(df, subj) %>% 
         group_by(subj) %>%
         summarise(rank = mean(rank))
# A tibble: 9 × 2
#    subj  rank
#   <chr> <dbl>
#1     A   1.0
#2     B   2.0
#3     C   4.0
#4     D   4.0
#5     E   4.0
#6     F   6.0
#7     G   7.5
#8     H   7.5
#9     I   9.0