使用聚合函数计算数据框中的输出

时间:2017-04-24 22:16:35

标签: r dataframe aggregate

我一直在尝试自己并在网上搜索一段时间并且堆栈溢出没有成功。我有一个数据框,我从应用条件和选择投影中进行了子集,但无法检索聚合输出。

数据框mydf

mydf = list()
mydf = cbind(mydf, 
            c("New York", "New York", "San Francisco"),
            c(4000, 7600, 2500),
            c("Bartosz", "Damian", "Maciej"))
mydf = as.data.frame(mydf)
colnames(mydf) = c("city","salary","name")

让我们假设给定的部分数据帧返回:

subset(mydf, city == "New York", select = c(salary, name))

返回数据框,例如:

   salary    name
9    4000 Bartosz
10   7600  Damian

现在我需要从给定的薪水计算sumavg,并从上面的数据框中选择薪水最低的员工,最好通过修改上述代码使用单行(I&#39 ;猜测它可能),以便它返回:

for sum:11600

for avg:5800

至少:4000 Bartosz

我已尝试过(1)

subset(mydf, city == "New York", select = sum(salary))

或(2)

x = subset(mydf, city == "New York", select = salary)
min(x)

以及更多组合只会产生错误,表示汇总函数仅在数据框上定义,所有变量都是数字(2)或与第一个代码相同的输出而没有sum(1)

5 个答案:

答案 0 :(得分:2)

你的mydf很奇怪,所以我自己做了。我将mydf除以city,然后从每个子组运行必要的操作(平均值,总和等)中获取必要的数据。

#DATA
mydf = structure(list(city = structure(c(1L, 1L, 2L), .Label = c("New York", 
"San Francisco"), class = "factor"), salary = c(4000, 7600, 2500
), name = structure(1:3, .Label = c("Bartosz", "Damian", "Maciej"
), class = "factor")), .Names = c("city", "salary", "name"), row.names = c(NA, 
-3L), class = "data.frame")

do.call(rbind, lapply(split(mydf, mydf$city), function(a)
    data.frame(employee = a$name[which.min(a$salary)], #employee with least salary
               mean = mean(a$salary), #mean salary
               sum = sum(a$salary)))) #sum of salary
#              employee mean   sum
#New York       Bartosz 5800 11600
#San Francisco   Maciej 2500  2500

答案 1 :(得分:2)

问题可能是您的数据框对象实际上包含一堆列表。所以,如果你采取

salesLine_DS.refresh();

然后,任何后续工作都需要通过ny.df = subset(mydf, city == "New York", select = c(salary, name)) 调用来将您的列表转换为向量。这些将为您提供答案:

as.numeric

或者,您可以将sum(as.numeric(ny.df$salary)) # sum mean(as.numeric(ny.df$salary)) # avg ny.df[which(as.numeric(ny.df$salary) == min(as.numeric(ny.df$salary))),] # row with min salary 定义为向量的数据框,而不是列表的数据框:

mydf

答案 2 :(得分:1)

您的数据框架的结构与数据框中的列表不同,这可能是您发出的问题。这是一个dplyr解决方案(现在编辑以找到最低工资)

library(dplyr)
mydf <- data.frame(
             city = c("New York", "New York", "San Francisco"),
             salary = c(4000, 7600, 2500),
             name = c("Bartosz", "Damian", "Maciej"))

mydf %>% 
  group_by(city) %>%
  mutate(avg = mean(salary),
         sum = sum(salary)) %>%
  top_n(-1, wt = salary) 

#            city salary    name   avg   sum
#          <fctr>  <dbl>  <fctr> <dbl> <dbl>
# 1      New York   4000 Bartosz  5800 11600
# 2 San Francisco   2500  Maciej  2500  2500

答案 3 :(得分:1)

我认为dplyr是你可能正在寻找的东西:

   library(dplyr)
   mydf %>% 
   group_by(city) %>% 
   filter (city =="New York") %>%
   summarise(mean(salary), sum(salary))

  # A tibble: 1 x 3
  #  city mean(salary) sum(salary)
  #  <fctr>        <dbl>       <dbl>
  #1 New York         5800       11600

此链接链接[https://rpubs.com/justmarkham/dplyr-tutorial]

有一个很好的教程

答案 4 :(得分:1)

使用data.table

有一个简单快速的解决方案
library(data.table) 

setDT(mydf)[, .( salary_sum = sum(salary),
                 salary_avg = mean(salary),
                 name = name[which.min(salary)]), by= city]

>             city salary_sum salary_avg    name
> 1:      New York      11600       5800 Bartosz
> 2: San Francisco       2500       2500  Maciej

您的数据集:

mydf = data.frame(city=c("New York", "New York", "San Francisco"),
                  salary=c(4000, 7600, 2500),
                  name=c("Bartosz", "Damian", "Maciej"))