我有一个数据框,其中仅包含一行和命名列。 数据框看起来有点像这样:
poms_tat1 poms_tat2 poms_tat3 tens1 tens2 tens3 ...
1 0.3708821 0.4915922 0.3958195 -0.1139606 -0.1462545 -0.4411494 ...
我需要计算所有具有相似名称的列的平均值。结果应该是这样的:
poms_tat tens ...
1 0.4194551 -0.2337881667 ...
我的第一种方法是使用 for 循环和嵌套的 while 循环来查找相关列的索引,然后表示这些索引,但不幸的是我无法使其工作。
我还发现 this stackoverflow post 看起来很有希望,但 agrep 函数似乎匹配了我的数据框中不应该匹配的列。我无法使用 max.distance 参数解决这个问题。例如,它将“threat1-3”与“reat1-3”匹配。我知道那些变量名很糟糕,但不幸的是,这就是我必须使用的。 更复杂的是,每个类别的列数并不总是 3。
我希望我能够很好地阐明我的问题。谢谢。
编辑: 这是一个可重现的数据:
structure(list(poms_tat1 = 0.370882118644872, poms_tat2 = 0.491592168116328,
poms_tat3 = 0.395819547420188, tens1 = -0.113960576459638,
tens2 = -0.146254484825426, tens3 = -0.44114940169153, bat_ratio1 = 1,
isi1 = 0.0944068640061701, isi2 = 0.597785124823513, isi3 = 0.676617801589949,
isi4 = 0.143940321201716, sleepqual = 0.378902118888194,
se1 = 0.393610946830482, se2 = 0.0991899501072693, se3 = 0.501745206004254,
challenge1 = 0.417855447018672, challenge2 = 0.393610946830482,
challenge3 = 0.417855447018672, threat1 = -0.13014390184863,
threat2 = -0.34027852368936, threat3 = -0.269679944985297,
reat1 = 0.565825152115738, reat2 = 0.571605347479646, reat3 = 0.497468338163091,
reat4 = 0.484881137876427, reat5 = 0.494727444918154, selfman1 = 0.389249472080761,
selfman2 = 0.40609787800914, selfman3 = 0.418121005003545,
selfman4 = 0.467099366496914, selfman5 = 0.205356548067582,
selfman6 = 0.464385939554693, selfman7 = 0.379071252751718,
eli1 = 0.250872603002127, eli2 = 0, eli3 = 0.265908011739155), row.names = 1L, class = "data.frame")
答案 0 :(得分:4)
我们可以使用 split.default
根据列名的子串拆分为 list
,然后使用 list
遍历 sapply
,得到 rowMeans
在base R
sapply(split.default(df1, sub("\\d+$", "", names(df1))), rowMeans, na.rm = TRUE)
答案 1 :(得分:3)
您可以通过 tidyr::pivot_longer
、dplyr::mutate
、stringr::str_remove
、dplyr::group_by
和 dplyr::summarise
执行此操作。
这将是这样完成的:
ex_data <- structure(list(poms_tat1 = 0.370882118644872, poms_tat2 = 0.491592168116328,
poms_tat3 = 0.395819547420188, tens1 = -0.113960576459638,
tens2 = -0.146254484825426, tens3 = -0.44114940169153, bat_ratio1 = 1,
isi1 = 0.0944068640061701, isi2 = 0.597785124823513, isi3 = 0.676617801589949,
isi4 = 0.143940321201716, sleepqual = 0.378902118888194,
se1 = 0.393610946830482, se2 = 0.0991899501072693, se3 = 0.501745206004254,
challenge1 = 0.417855447018672, challenge2 = 0.393610946830482,
challenge3 = 0.417855447018672, threat1 = -0.13014390184863,
threat2 = -0.34027852368936, threat3 = -0.269679944985297,
reat1 = 0.565825152115738, reat2 = 0.571605347479646, reat3 = 0.497468338163091,
reat4 = 0.484881137876427, reat5 = 0.494727444918154, selfman1 = 0.389249472080761,
selfman2 = 0.40609787800914, selfman3 = 0.418121005003545,
selfman4 = 0.467099366496914, selfman5 = 0.205356548067582,
selfman6 = 0.464385939554693, selfman7 = 0.379071252751718,
eli1 = 0.250872603002127, eli2 = 0, eli3 = 0.265908011739155), row.names = 1L, class = "data.frame")
ex_data %>%
tidyr::pivot_longer(everything()) %>%
dplyr::mutate(
name = stringr::str_remove(name, '[0-9]$')
) %>%
dplyr::group_by(name) %>%
dplyr::summarise(
mean = mean(value)
)
# A tibble: 11 x 2
name mean
<chr> <dbl>
1 bat_ratio 1
2 challenge 0.410
3 eli 0.172
4 isi 0.378
5 poms_tat 0.419
6 reat 0.523
7 se 0.332
8 selfman 0.390
9 sleepqual 0.379
10 tens -0.234
11 threat -0.247
或者,您可以将 split.default
与 stringr::str_remove
、purrr::map
、unlist
、purrr::map_dbl
和 tibble::enframe
一起使用,如下所示:< /p>
ex_data %>%
split.default(stringr::str_remove(names(.), '[0-9]$')) %>%
purrr::map(unlist) %>%
purrr::map_dbl(mean) %>%
tibble::enframe()
# A tibble: 11 x 2
name value
<chr> <dbl>
1 bat_ratio 1
2 challenge 0.410
3 eli 0.172
4 isi 0.378
5 poms_tat 0.419
6 reat 0.523
7 se 0.332
8 selfman 0.390
9 sleepqual 0.379
10 tens -0.234
11 threat -0.247
答案 2 :(得分:3)
您也可以使用以下解决方案:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(everything()) %>%
group_by(grp = sub("\\d+$", "", name)) %>%
summarise(Avg = mean(value, na.rm = TRUE))
grp Avg
<chr> <dbl>
1 bat_ratio 1
2 challenge 0.410
3 eli 0.172
4 isi 0.378
5 poms_tat 0.419
6 reat 0.523
7 se 0.332
8 selfman 0.390
9 sleepqual 0.379
10 tens -0.234
11 threat -0.247
答案 3 :(得分:3)
这是带有 tidyr
的一个。刚刚看到 Baraliuh 的 answer 被接受,所以我这里的回答更多是为了结束。
library(tidyr)
my_summary <- as.data.frame(sapply(X = pivot_longer(data = df,
# Desired columns (all) to summarize.
cols = everything(),
# Take each group of columns, which
# share a common name before different
# numeric suffixes, and pivot them
# into multiple rows under a common
# column by that name.
names_to = c(".value",
# Discard anything after
# the prefix.
NA),
# Identify the (optional) numeric
# suffix.
names_sep = "\\d*$"),
# Take the mean of each column; ignore missing values.
FUN = mean, na.rm = TRUE,
# Keep as a list, to convert into a data.frame.
simplify = FALSE))
与某些替代方案相比,我确实相信我对旋转的综合使用使流程更清晰,并且输出确实达到了您想要的精确格式。
poms_tat tens bat_ratio isi sleepqual se challenge threat reat selfman eli
1 0.4194313 -0.2337882 1 0.3781875 0.3789021 0.3315154 0.4097739 -0.2467008 0.5229015 0.3899116 0.1722602
如果还涉及 dplyr
,则摘要(到 tibble
中)更加清晰:
library(tidyr)
library(dplyr)
my_summary <- pivot_longer(data = df, cols = everything(),
names_to = c(".value", NA), names_sep = "\\d*$") %>%
summarize(across(everything(), mean, na.rm = TRUE))