如何基于R中组内的字符串向量对组内的多列排序?

时间:2019-03-09 01:10:47

标签: r sorting vector dplyr

我正在处理选举数据,其中每次选举均根据随机生成的字母列出候选人;我正在尝试重新创建候选排序。因此,对于选举1,字母可能是“ B,C,A”,因此该选举中的候选人将按“布莱恩特,卡尔森,安德森”的顺序出现;选举2中字母为“ C,A,B”的候选人将被列为“查韦斯,阿姆斯特朗,布朗”;等等。

我的计划是将名称分为各个字母的列,然后使用该字母的字母对这些列进行排序。这是最小的df:

set.seed(124)

a2000 <- sample(letters[1:9], r = F)
a2001 <- sample(letters[1:9], r = F)
a2002 <- sample(letters[1:9], r = F)


d <- data.frame(
  electionid = rep(1:3, each = 3),
  year = rep(2000:2002, each = 3),
  last1 = sample(letters[1:9], r = T),
  last2 = sample(letters[1:9], r = T),
  alph = I(list(a2000,a2000,a2000,a2001,a2001,a2001,a2002,a2002,a2002)))

last_cols <- c("last1", "last2") # in real dataset, many more columns here

newd <- d %>% 
  group_by(electionid) %>% # order columns within election, not within dataframe
  arrange_at(last_cols, funs(factor(., levels = alph[[1]]))) %>% # set new alphabet for each variable
  mutate(race_order = seq(n())) %>% # create ordering variable
  arrange(electionid) # rearrange for easy checking

关键问题是每个字母都与electionid同义,因此仅使用last1变量尝试将每一列(last2alph)分解为一个整数评估错误(例如factor level [2] is duplicated.)。当我使用alph[[1]](在示例中),unique(alph)first(alph)coalesce(alph)等选项时,我摆脱了该错误,但似乎要么使用数据集中出现的第一个字母(a2000),或者根本不排序。

我上面的代码(使用dput)得到了什么:

structure(list(electionid = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 
3L), year = c(2000L, 2000L, 2000L, 2001L, 2001L, 2001L, 2002L, 
2002L, 2002L), last1 = structure(c(1L, 1L, 4L, 3L, 2L, 6L, 3L, 
5L, 5L), .Label = c("a", "d", "f", "g", "h", "i"), class = "factor"), 
    last2 = structure(c(1L, 6L, 6L, 5L, 2L, 4L, 3L, 3L, 4L), .Label = c("c", 
    "d", "e", "g", "h", "i"), class = "factor"), alph = structure(list(
        c("b", "e", "g", "d", "c", "f", "h", "a", "i"), c("b", 
        "e", "g", "d", "c", "f", "h", "a", "i"), c("b", "e", 
        "g", "d", "c", "f", "h", "a", "i"), c("a", "h", "g", 
        "d", "b", "e", "f", "i", "c"), c("a", "h", "g", "d", 
        "b", "e", "f", "i", "c"), c("a", "h", "g", "d", "b", 
        "e", "f", "i", "c"), c("a", "g", "h", "e", "d", "b", 
        "f", "i", "c"), c("a", "g", "h", "e", "d", "b", "f", 
        "i", "c"), c("a", "g", "h", "e", "d", "b", "f", "i", 
        "c")), class = "AsIs"), race_order = c(1L, 2L, 3L, 1L, 
    2L, 3L, 1L, 2L, 3L)), class = c("grouped_df", "tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -9L), vars = "electionid", indices = list(
    0:2, 3:5, 6:8), drop = TRUE, group_sizes = c(3L, 3L, 3L), biggest_group_size = 3L, labels = structure(list(
    electionid = 1:3), class = "data.frame", row.names = c(NA, 
-3L), vars = "electionid", indices = list(0:2, 3:5, 6:8), drop = TRUE, group_sizes = c(3L, 3L, 3L), biggest_group_size = 3L))

打印为:

# A tibble: 9 x 6
# Groups:   electionid [3]
  electionid  year last1 last2 alph      race_order
       <int> <int> <fct> <fct> <I(list)>      <int>
1          1  2000 c     b     <chr [9]>          1 # This is with alphabet a2000
2          1  2000 h     i     <chr [9]>          2 # This is with alphabet a2000
3          1  2000 d     b     <chr [9]>          3 # This is with alphabet a2000
4          2  2001 h     c     <chr [9]>          1 # This is with alphabet a2001
5          2  2001 c     b     <chr [9]>          2 # This is with alphabet a2001
6          2  2001 d     h     <chr [9]>          3 # This is with alphabet a2001
7          3  2002 h     g     <chr [9]>          1 # This is with alphabet a2002
8          3  2002 a     c     <chr [9]>          2 # This is with alphabet a2002
9          3  2002 i     d     <chr [9]>          3 # This is with alphabet a2002

我想要得到的:

# A tibble: 9 x 6
# Groups:   electionid [3]
  electionid  year last1 last2 alph      race_order
       <int> <int> <fct> <fct> <I(list)>      <int>
1          1  2000 d     b     <chr [9]>          1 # Previously ranked 3
2          1  2000 h     i     <chr [9]>          2
3          1  2000 c     b     <chr [9]>          3 # Previously ranked 1
4          2  2001 c     b     <chr [9]>          1 # Previously ranked 2
5          2  2001 h     c     <chr [9]>          2 # Previously ranked 1
6          2  2001 d     h     <chr [9]>          3
7          3  2002 h     g     <chr [9]>          1 
8          3  2002 a     c     <chr [9]>          2
9          3  2002 i     d     <chr [9]>          3

换句话说,我将生成一个计数变量(race_order),该变量将根据该选举的唯一字母应用于其姓名来重新创建列出每个候选人的位置!

0 个答案:

没有答案