我有一个可怕的数据框架,其中许多维度的值都是逗号分隔的数组-我希望将这些运算应用于计数,总和,均值等值,而不是这些数组
e.g.
colA ColB
A [0.0,0.0,0.0,2177.0068,0.0,0.0,0.0,0.0,0.0,0.0]
B [0.0,0.0,650.2635,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
C [0.0,0.0,406.3296,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
D \N
E [0.0,0.0,982.2527,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
F [0.0,0.0,0.0,163.6882,0.0,0.0,0.0,0.0,0.0,0.0]
有人对每个数组求和/计数/均值有优雅的方法吗?
谢谢
答案 0 :(得分:0)
将其转换为长格式,在这种情况下,很容易执行聚合。
1)假定末尾的注释中可重复显示的DF
除去了ColB
中的方括号,并将ColB
分成可适当转换的行。然后按colA
分组,并得到ColB
的总和和均值(也可以使用其他聚合函数)。如果您不希望D用NA过滤掉ColB不以[开头的行。请参阅(2)中的filter
语句。
library(dplyr)
library(tidyr)
DF %>%
mutate(ColB = gsub("[][]", "", ColB)) %>%
separate_rows(ColB, sep = "[^-0-9.]", convert = TRUE) %>%
group_by(ColA) %>%
summarize(Sum = sum(ColB), Mean = mean(ColB)) %>%
ungroup
给予:
# A tibble: 6 x 3
ColA Sum Mean
<chr> <dbl> <dbl>
1 A 2177. 218.
2 B 650. 65.0
3 C 406. 40.6
4 D NA NA
5 E 982. 98.2
6 F 164. 16.4
2)交替使用以[开头的ColB字符串是JSON。在这种情况下,我们首先过滤掉colB
的非JSON元素。
library(dplyr)
library(jsonlite)
library(tidyr)
DF %>%
filter(substring(ColB, 1, 1) == "[") %>%
rowwise() %>%
mutate(ColB = list(fromJSON(ColB))) %>%
ungroup %>%
unnest %>%
group_by(ColA) %>%
summarize(Sum = sum(ColB), Mean = mean(ColB)) %>%
ungroup
给予:
# A tibble: 5 x 3
ColA Sum Mean
<chr> <dbl> <dbl>
1 A 2177. 218.
2 B 650. 65.0
3 C 406. 40.6
4 E 982. 98.2
5 F 164. 16.4
Lines <- "ColA ColB
A [0.0,0.0,0.0,2177.0068,0.0,0.0,0.0,0.0,0.0,0.0]
B [0.0,0.0,650.2635,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
C [0.0,0.0,406.3296,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
D \\N
E [0.0,0.0,982.2527,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
F [0.0,0.0,0.0,163.6882,0.0,0.0,0.0,0.0,0.0,0.0]"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)