R-包含逗号分隔值数组的数据框

时间:2018-07-25 15:00:11

标签: arrays r

我有一个可怕的数据框架,其中许多维度的值都是逗号分隔的数组-我希望将这些运算应用于计数,总和,均值等值,而不是这些数组

e.g. 
    colA ColB
    A [0.0,0.0,0.0,2177.0068,0.0,0.0,0.0,0.0,0.0,0.0]
    B [0.0,0.0,650.2635,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
    C [0.0,0.0,406.3296,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
    D \N
    E [0.0,0.0,982.2527,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
    F [0.0,0.0,0.0,163.6882,0.0,0.0,0.0,0.0,0.0,0.0]

有人对每个数组求和/计数/均值有优雅的方法吗?

谢谢

1 个答案:

答案 0 :(得分:0)

将其转换为长格式,在这种情况下,很容易执行聚合。

1)假定末尾的注释中可重复显示的DF除去了ColB中的方括号,并将ColB分成可适当转换的行。然后按colA分组,并得到ColB的总和和均值(也可以使用其他聚合函数)。如果您不希望D用NA过滤掉ColB不以[开头的行。请参阅(2)中的filter语句。

library(dplyr)
library(tidyr)

DF %>%
   mutate(ColB = gsub("[][]", "", ColB)) %>%
   separate_rows(ColB, sep = "[^-0-9.]", convert = TRUE) %>%
   group_by(ColA) %>%
   summarize(Sum = sum(ColB), Mean = mean(ColB)) %>%
   ungroup

给予:

# A tibble: 6 x 3
  ColA    Sum  Mean
  <chr> <dbl> <dbl>
1 A     2177. 218. 
2 B      650.  65.0
3 C      406.  40.6
4 D       NA   NA  
5 E      982.  98.2
6 F      164.  16.4

2)交替使用以[开头的ColB字符串是JSON。在这种情况下,我们首先过滤掉colB的非JSON元素。

library(dplyr)
library(jsonlite)
library(tidyr)

DF %>%
   filter(substring(ColB, 1, 1) == "[") %>%
   rowwise() %>%
   mutate(ColB = list(fromJSON(ColB))) %>%
   ungroup %>%
   unnest %>%
   group_by(ColA) %>%
   summarize(Sum = sum(ColB), Mean = mean(ColB)) %>%
   ungroup

给予:

# A tibble: 5 x 3
  ColA    Sum  Mean
  <chr> <dbl> <dbl>
1 A     2177. 218. 
2 B      650.  65.0
3 C      406.  40.6
4 E      982.  98.2
5 F      164.  16.4

注意

Lines <- "ColA ColB
    A [0.0,0.0,0.0,2177.0068,0.0,0.0,0.0,0.0,0.0,0.0]
    B [0.0,0.0,650.2635,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
    C [0.0,0.0,406.3296,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
    D \\N
    E [0.0,0.0,982.2527,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
    F [0.0,0.0,0.0,163.6882,0.0,0.0,0.0,0.0,0.0,0.0]"

DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)