r tidyverse-计算具有相同名称的多列的均值

时间:2018-08-14 21:43:50

标签: r dplyr

我有一些每周收集的数据,其摘要就像通过dput一样:

p <- structure(list(railroad = structure(c(2L, 2L, 2L, 3L, 3L, 3L), .Label = 
c("All Other Railroads", 
"BNSF Railway Company", "CN", "CSX Transportation", "Norfolk Southern", 
"The Kansas City Southern Railway and Kansas City Southern de Mexico, S.A. de 
C.V. Consolidated ", 
"Union Pacific Railroad"), class = "factor"), measure = structure(c(1L, 
4L, 3L, 1L, 4L, 3L), .Label = c("Cars On Line - By Car Owner", 
"Cars On Line - By Car Type", "Terminal Dwell (Hours)", "Train Speed (MPH)"
), class = "factor"), category = structure(c(76L, 35L, 4L, 76L, 
35L, 29L), .Label = c("All Trains", "Allentown, PA", "Baltimore, MD", 
"Barstow, CA", "Bellevue, OH", "Birmingham, AL", "Box", "Buffalo, NY", 
"Chattanooga, TN", "Chicago (Proviso), IL", "Chicago, IL", "Cincinnati, OH", 
"Coal Unit", "Columbus, OH", "Conway, PA", "Corbin, KY", "Covered Hopper", 
"Decatur, IL", "Denver, CO", "Elkhart, IN", "Entire Railroad", 
"Fond du Lac Yard, WI", "Foreign RR", "Fort Worth, TX", "Galesburg, IL", 
"Gondola", "Grain Unit", "Hamlet, NC", "Harrison Yard (Memphis), TN", 
"Hinkle, OR", "Houston (Englewood), TX", "Houston (Settegast), TX", 
"Houston, TX", "Indianapolis, IN", "Intermodal", "Jackson Yard, MS", 
"Jackson, MS", "Kansas City, KS", "Kansas City, MO", "Knoxville, TN", 
"Laredo, TX", "Lincoln, NE", "Linwood, NC", "Livonia, LA", "Louisville, KY", 
"MacMillan Yard (Toronto), ON", "Macon, GA", "Manifest", "Markham Yard, IL", 
"Memphis, TN", "Monterrey, NL", "Montgomery, AL", "Multilevel", 
"Nashville, TN", "New Orleans, LA", "North Little Rock, AR", 
"North Platte East, NE", "North Platte West, NE", "Northtown, MN", 
"Nuevo Laredo, TM", "Open Hopper", "Other", "Pasco, WA", "Pct. Private", 
"Pine Bluff, AR", "Private", "Roanoke, VA", "Roseville, CA", 
"Russell, KY", "San Luis Potosi, SL", "Sanchez, TM", "Selkirk, NY", 
"Sheffield, AL", "Shreveport, LA", "Symington Yard (Winnipeg), MB", 
"System", "Tank", "Tascherau Yard (Montreal), QC", "Thornton Yard (Vancouver), 
BC", 
"Toledo, OH", "Total", "Tulsa, OK", "Walker Yard (Edmonton), AB", 
"Waycross, GA", "West Colton, CA", "Willard, OH"), class = "factor"), 
`201510` = c(66923, 33.9, 39.3, 40227, 30.8, 17.5), `201510` = c(66637, 
32.6, 56.6, 40778, 30.9, 18.3), `201510` = c(66309, 33.4, 
44.9, 40407, 30.5, 17.3), `201511` = c(65980, 34.6, 37.5, 
40316, 30.6, 17.5), `201511` = c(67034, 34.6, 43.1, 40174, 
30.4, 18.7)), row.names = c(1L, 15L, 21L, 33L, 47L, 53L), class = 
"data.frame")

共有143列,而第4-143列是数字。我想计算具有相同列名的所有列的均值。因此,下面的列201510重复了3次,而列201511重复了两次。期望的输出是重复的每列的平均值。例如,201510将具有以下值:

`201510`
[1] 66623.00000    33.30000    46.93333 40470.66667    30.73333    17.70000

我尝试了以下代码:

library(tidyverse)

p = data.frame(p)

p %>%
  gather(time,value,railroad, measure, category) %>%                       
  mutate(time = gsub('X([^.]+)|.', '\\1', time)) %>%  
  group_by(time, value, railroad, measure, category) %>%                            
  summarise(MEAN = mean(value)) %>%                   
  ungroup() %>%                                       
  spread(time, MEAN)  

哪个会产生以下错误:

`Error in grouped_df_impl(data, unname(vars), drop) : 
Column `railroad` is unknown
In addition: Warning message:
attributes are not identical across measure variables;
they will be dropped `

有没有办法做到这一点?

3 个答案:

答案 0 :(得分:3)

首先按列名称拆分数据帧,然后在每个子数据帧上应用rowMeans

lapply(split.default(p[,4:length(p)], names(p)[4:length(p)]), rowMeans)
#$`201510`
#          1          15          21          33          47          53 
#66623.00000    33.30000    46.93333 40470.66667    30.73333    17.70000 

#$`201511`
#      1      15      21      33      47      53 
#66507.0    34.6    40.3 40245.0    30.5    18.1 

即使上述方法可行,您也应避免对不同的列使用相同的名称,因为R最终将重命名这些列以使每个列的名称唯一;您最好重新考虑如何处理数据,可能会改变数据框的形状,以便将年份放入单个列,然后可以按年份分组和汇总。

p %>% 
    # create the row number to identify each row
    mutate(rn = row_number()) %>% 
    # gather time columns into a single column
    gather('time', 'value', -rn, -railroad, -measure, -category) %>% 
    mutate(time = sub('X([^.]+).*', '\\1', time)) %>% 
    # group and aggregate
    group_by(rn, railroad, measure, category, time) %>% 
    summarise(value = mean(value)) %>% 
    # split value by time
    {split(.$value, .$time)}

#$`201510`
#[1] 66623.00000    33.30000    46.93333 40470.66667    30.73333    17.70000

#$`201511`
#[1] 66507.0    34.6    40.3 40245.0    30.5    18.1

答案 1 :(得分:3)

这里的主要问题是列名不唯一。 tidyverse通常假定唯一的列名,并且许多函数添加后缀以使它们唯一(如果还没有的话),这与许多基本函数一样,因此在下面的所有解决方案中,我们都避免使用任何此类函数。我们仍然可以使用magrittr,purrr,某些基本函数仍然允许这样做。

(1),(2)和(4)仅使用magrittr。 (1a)使用purrr,在(3)中,我们使用tidyr和dplyr,但仅在转换为长格式之后。

所有解决方案都为数字列中的每个唯一名称附加一列,其名称的格式为mean.*。在问题的示例中,数字列之间有两个唯一的名称,因此在该示例中,它追加了两个列,它们分别名为mean.201510mean.201511,如下所示。我们仅在(1)中显示输出,而其余输出相似。

所有解决方案都使用两个管道。第一个由第一个%>%组成,第二个pipleline作为cbind的参数出现,并且是创建新列的原因。

(1),(1a)和(4)最短。

1)magrittr magrittr本身似乎没有添加后缀。使用以下内容cbind原始数据帧p。首先将p转换为列列表,提取数字分量,将其拆分为列名,将每个分量转换为数据框,并获取每个的rowMeans,最后将名称设置为mean。*。 >

library(magrittr)

p %>%
  cbind(as.list(.) %>%
    Filter(is.numeric, .) %>%
    split(names(.)) %>%
    lapply(as.data.frame) %>%
    lapply(rowMeans) %>%
    setNames(paste0("mean.", names(.)))
  )

给予:

               railroad                     measure                    category
1  BNSF Railway Company Cars On Line - By Car Owner                      System
15 BNSF Railway Company           Train Speed (MPH)                  Intermodal
21 BNSF Railway Company      Terminal Dwell (Hours)                 Barstow, CA
33                   CN Cars On Line - By Car Owner                      System
47                   CN           Train Speed (MPH)                  Intermodal
53                   CN      Terminal Dwell (Hours) Harrison Yard (Memphis), TN
    201510  201510  201510  201511  201511 mean.201510 mean.201511
1  66923.0 66637.0 66309.0 65980.0 67034.0 66623.00000     66507.0
15    33.9    32.6    33.4    34.6    34.6    33.30000        34.6
21    39.3    56.6    44.9    37.5    43.1    46.93333        40.3
33 40227.0 40778.0 40407.0 40316.0 40174.0 40470.66667     40245.0
47    30.8    30.9    30.5    30.6    30.4    30.73333        30.5
53    17.5    18.3    17.3    17.5    18.7    17.70000        18.1

1a)purrr (可选)我们可以将某些基本函数替换为它们的purrr或magrittr。在其他解决方案中,我们也可以将其翻译为purrr。

library(magrittr)
library(purrr)

p %>%
  cbind(as.list(.) %>%
    keep(is.numeric) %>%
    split(names(.)) %>%
    map(as.data.frame) %>%
    map(rowMeans) %>%
    set_names(paste0("mean.", names(.)))
  )

2)应用/应用另一种可能性是分别tapply跨每一行。 apply行执行此操作。

library(magrittr)

p %>%
  cbind(as.list(.) %>%
    Filter(is.numeric, .) %>%
    do.call("cbind", .) %>%
    apply(1, tapply, colnames(.), mean) %>%
    t %>%
    as.data.frame %>%
    setNames(paste0("mean.", names(.)))
  )

3)as.data.frame.table 这种方法对大多数操作都使用dplyr和tidyr,但使用as.data.frame.table而不是gather来转换为长格式为了避免添加后缀的问题。

library(dplyr)
library(magrittr)
library(tidyr)

p %>%
  cbind(as.list(.) %>%
    keep(is.numeric) %>%
    do.call("cbind", .) %>%
    as.data.frame.table %>%
    group_by(Var2, Var1) %>%
    summarize(Mean = mean(Freq)) %>%
    ungroup %>%
    spread(Var2, Mean) %>%
    select(-Var1) %>%
    set_names(paste0("mean.", names(.)))
  )

4)lm 如果X是数字列,而mean.是列名,则t(coef(lm(t(X) ~ mean. - 1)))给出所需的均值列,因此:

library(magrittr)

p %>%
  cbind(as.list(.) %>%
    Filter(is.numeric, .) %>%
    do.call("cbind", .) %>%
    { lm(t(.) ~ mean. - 1, data.frame(mean. = colnames(.))) } %>%
    coef %>%
    t
  )

答案 2 :(得分:0)

您的列没有唯一的名称,因此您可能不了解整洁的数据如何工作,列名称不是存储相关信息的地方,请阅读https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html

为什么整洁的数据很重要,是因为您正在使用其创建者功能。