具有几乎相同名称的平均列

时间:2021-06-23 16:12:26

标签: r

我有一个数据框,其中仅包含一行和命名列。 数据框看起来有点像这样:

  poms_tat1 poms_tat2 poms_tat3      tens1      tens2      tens3 ...
1 0.3708821 0.4915922 0.3958195 -0.1139606 -0.1462545 -0.4411494 ...

我需要计算所有具有相似名称的列的平均值。结果应该是这样的:


  poms_tat    tens      ...
1 0.4194551  -0.2337881667 ...

我的第一种方法是使用 for 循环和嵌套的 while 循环来查找相关列的索引,然后表示这些索引,但不幸的是我无法使其工作。

我还发现 this stackoverflow post 看起来很有希望,但 agrep 函数似乎匹配了我的数据框中不应该匹配的列。我无法使用 max.distance 参数解决这个问题。例如,它将“threat1-3”与“reat1-3”匹配。我知道那些变量名很糟糕,但不幸的是,这就是我必须使用的。 更复杂的是,每个类别的列数并不总是 3。

我希望我能够很好地阐明我的问题。谢谢。

编辑: 这是一个可重现的数据:

structure(list(poms_tat1 = 0.370882118644872, poms_tat2 = 0.491592168116328, 
    poms_tat3 = 0.395819547420188, tens1 = -0.113960576459638, 
    tens2 = -0.146254484825426, tens3 = -0.44114940169153, bat_ratio1 = 1, 
    isi1 = 0.0944068640061701, isi2 = 0.597785124823513, isi3 = 0.676617801589949, 
    isi4 = 0.143940321201716, sleepqual = 0.378902118888194, 
    se1 = 0.393610946830482, se2 = 0.0991899501072693, se3 = 0.501745206004254, 
    challenge1 = 0.417855447018672, challenge2 = 0.393610946830482, 
    challenge3 = 0.417855447018672, threat1 = -0.13014390184863, 
    threat2 = -0.34027852368936, threat3 = -0.269679944985297, 
    reat1 = 0.565825152115738, reat2 = 0.571605347479646, reat3 = 0.497468338163091, 
    reat4 = 0.484881137876427, reat5 = 0.494727444918154, selfman1 = 0.389249472080761, 
    selfman2 = 0.40609787800914, selfman3 = 0.418121005003545, 
    selfman4 = 0.467099366496914, selfman5 = 0.205356548067582, 
    selfman6 = 0.464385939554693, selfman7 = 0.379071252751718, 
    eli1 = 0.250872603002127, eli2 = 0, eli3 = 0.265908011739155), row.names = 1L, class = "data.frame")

4 个答案:

答案 0 :(得分:4)

我们可以使用 split.default 根据列名的子串拆分为 list,然后使用 list 遍历 sapply,得到 rowMeansbase R

sapply(split.default(df1, sub("\\d+$", "", names(df1))), rowMeans, na.rm = TRUE)

答案 1 :(得分:3)

您可以通过 tidyr::pivot_longerdplyr::mutatestringr::str_removedplyr::group_bydplyr::summarise 执行此操作。 这将是这样完成的:

ex_data <- structure(list(poms_tat1 = 0.370882118644872, poms_tat2 = 0.491592168116328, 
               poms_tat3 = 0.395819547420188, tens1 = -0.113960576459638, 
               tens2 = -0.146254484825426, tens3 = -0.44114940169153, bat_ratio1 = 1, 
               isi1 = 0.0944068640061701, isi2 = 0.597785124823513, isi3 = 0.676617801589949, 
               isi4 = 0.143940321201716, sleepqual = 0.378902118888194, 
               se1 = 0.393610946830482, se2 = 0.0991899501072693, se3 = 0.501745206004254, 
               challenge1 = 0.417855447018672, challenge2 = 0.393610946830482, 
               challenge3 = 0.417855447018672, threat1 = -0.13014390184863, 
               threat2 = -0.34027852368936, threat3 = -0.269679944985297, 
               reat1 = 0.565825152115738, reat2 = 0.571605347479646, reat3 = 0.497468338163091, 
               reat4 = 0.484881137876427, reat5 = 0.494727444918154, selfman1 = 0.389249472080761, 
               selfman2 = 0.40609787800914, selfman3 = 0.418121005003545, 
               selfman4 = 0.467099366496914, selfman5 = 0.205356548067582, 
               selfman6 = 0.464385939554693, selfman7 = 0.379071252751718, 
               eli1 = 0.250872603002127, eli2 = 0, eli3 = 0.265908011739155), row.names = 1L, class = "data.frame")
ex_data %>% 
    tidyr::pivot_longer(everything()) %>% 
    dplyr::mutate(
        name = stringr::str_remove(name, '[0-9]$')
    ) %>% 
    dplyr::group_by(name) %>% 
    dplyr::summarise(
        mean = mean(value)
    )
# A tibble: 11 x 2
   name        mean
   <chr>      <dbl>
 1 bat_ratio  1    
 2 challenge  0.410
 3 eli        0.172
 4 isi        0.378
 5 poms_tat   0.419
 6 reat       0.523
 7 se         0.332
 8 selfman    0.390
 9 sleepqual  0.379
10 tens      -0.234
11 threat    -0.247

或者,您可以将 split.defaultstringr::str_removepurrr::mapunlistpurrr::map_dbltibble::enframe 一起使用,如下所示:< /p>

ex_data %>% 
    split.default(stringr::str_remove(names(.), '[0-9]$')) %>% 
    purrr::map(unlist) %>% 
    purrr::map_dbl(mean) %>% 
    tibble::enframe()
# A tibble: 11 x 2
   name       value
   <chr>      <dbl>
 1 bat_ratio  1    
 2 challenge  0.410
 3 eli        0.172
 4 isi        0.378
 5 poms_tat   0.419
 6 reat       0.523
 7 se         0.332
 8 selfman    0.390
 9 sleepqual  0.379
10 tens      -0.234
11 threat    -0.247

答案 2 :(得分:3)

您也可以使用以下解决方案:

library(dplyr)
library(tidyr)

df %>%
  pivot_longer(everything()) %>% 
  group_by(grp = sub("\\d+$", "", name)) %>%
  summarise(Avg = mean(value, na.rm = TRUE))

   grp          Avg
   <chr>      <dbl>
 1 bat_ratio  1    
 2 challenge  0.410
 3 eli        0.172
 4 isi        0.378
 5 poms_tat   0.419
 6 reat       0.523
 7 se         0.332
 8 selfman    0.390
 9 sleepqual  0.379
10 tens      -0.234
11 threat    -0.247

答案 3 :(得分:3)

这是带有 tidyr 的一个。刚刚看到 Baraliuh 的 answer 被接受,所以我这里的回答更多是为了结束。

library(tidyr)

my_summary <- as.data.frame(sapply(X = pivot_longer(data = df,
                                                    # Desired columns (all) to summarize.
                                                    cols = everything(),
                                                    # Take each group of columns, which
                                                    # share a common name before different
                                                    # numeric suffixes, and pivot them
                                                    # into multiple rows under a common
                                                    # column by that name.
                                                    names_to = c(".value",
                                                                 # Discard anything after
                                                                 # the prefix.
                                                                 NA),
                                                    # Identify the (optional) numeric
                                                    # suffix.
                                                    names_sep = "\\d*$"),
                                   # Take the mean of each column; ignore missing values.
                                   FUN = mean, na.rm = TRUE,
                                   # Keep as a list, to convert into a data.frame.
                                   simplify = FALSE))

与某些替代方案相比,我确实相信我对旋转的综合使用使流程更清晰,并且输出确实达到了您想要的精确格式

   poms_tat       tens bat_ratio       isi sleepqual        se challenge     threat      reat   selfman       eli
1 0.4194313 -0.2337882         1 0.3781875 0.3789021 0.3315154 0.4097739 -0.2467008 0.5229015 0.3899116 0.1722602

如果还涉及 dplyr,则摘要(到 tibble 中)更加清晰:

library(tidyr)
library(dplyr)

my_summary <- pivot_longer(data = df, cols = everything(),
                           names_to = c(".value", NA), names_sep = "\\d*$") %>%
  summarize(across(everything(), mean, na.rm = TRUE))