我一直在尝试自己并在网上搜索一段时间并且堆栈溢出没有成功。我有一个数据框,我从应用条件和选择投影中进行了子集,但无法检索聚合输出。
数据框mydf
:
mydf = list()
mydf = cbind(mydf,
c("New York", "New York", "San Francisco"),
c(4000, 7600, 2500),
c("Bartosz", "Damian", "Maciej"))
mydf = as.data.frame(mydf)
colnames(mydf) = c("city","salary","name")
让我们假设给定的部分数据帧返回:
subset(mydf, city == "New York", select = c(salary, name))
返回数据框,例如:
salary name
9 4000 Bartosz
10 7600 Damian
现在我需要从给定的薪水计算sum
,avg
,并从上面的数据框中选择薪水最低的员工,最好通过修改上述代码使用单行(I&#39 ;猜测它可能),以便它返回:
for sum
:11600
for avg
:5800
至少:4000 Bartosz
我已尝试过(1)
subset(mydf, city == "New York", select = sum(salary))
或(2)
x = subset(mydf, city == "New York", select = salary)
min(x)
以及更多组合只会产生错误,表示汇总函数仅在数据框上定义,所有变量都是数字(2)或与第一个代码相同的输出而没有sum
(1)
答案 0 :(得分:2)
你的mydf
很奇怪,所以我自己做了。我将mydf
除以city
,然后从每个子组运行必要的操作(平均值,总和等)中获取必要的数据。
#DATA
mydf = structure(list(city = structure(c(1L, 1L, 2L), .Label = c("New York",
"San Francisco"), class = "factor"), salary = c(4000, 7600, 2500
), name = structure(1:3, .Label = c("Bartosz", "Damian", "Maciej"
), class = "factor")), .Names = c("city", "salary", "name"), row.names = c(NA,
-3L), class = "data.frame")
do.call(rbind, lapply(split(mydf, mydf$city), function(a)
data.frame(employee = a$name[which.min(a$salary)], #employee with least salary
mean = mean(a$salary), #mean salary
sum = sum(a$salary)))) #sum of salary
# employee mean sum
#New York Bartosz 5800 11600
#San Francisco Maciej 2500 2500
答案 1 :(得分:2)
问题可能是您的数据框对象实际上包含一堆列表。所以,如果你采取
salesLine_DS.refresh();
然后,任何后续工作都需要通过ny.df = subset(mydf, city == "New York", select = c(salary, name))
调用来将您的列表转换为向量。这些将为您提供答案:
as.numeric
或者,您可以将sum(as.numeric(ny.df$salary)) # sum
mean(as.numeric(ny.df$salary)) # avg
ny.df[which(as.numeric(ny.df$salary) == min(as.numeric(ny.df$salary))),] # row with min salary
定义为向量的数据框,而不是列表的数据框:
mydf
答案 2 :(得分:1)
您的数据框架的结构与数据框中的列表不同,这可能是您发出的问题。这是一个dplyr解决方案(现在编辑以找到最低工资)
library(dplyr)
mydf <- data.frame(
city = c("New York", "New York", "San Francisco"),
salary = c(4000, 7600, 2500),
name = c("Bartosz", "Damian", "Maciej"))
mydf %>%
group_by(city) %>%
mutate(avg = mean(salary),
sum = sum(salary)) %>%
top_n(-1, wt = salary)
# city salary name avg sum
# <fctr> <dbl> <fctr> <dbl> <dbl>
# 1 New York 4000 Bartosz 5800 11600
# 2 San Francisco 2500 Maciej 2500 2500
答案 3 :(得分:1)
我认为dplyr是你可能正在寻找的东西:
library(dplyr)
mydf %>%
group_by(city) %>%
filter (city =="New York") %>%
summarise(mean(salary), sum(salary))
# A tibble: 1 x 3
# city mean(salary) sum(salary)
# <fctr> <dbl> <dbl>
#1 New York 5800 11600
有一个很好的教程
答案 4 :(得分:1)
使用data.table
library(data.table)
setDT(mydf)[, .( salary_sum = sum(salary),
salary_avg = mean(salary),
name = name[which.min(salary)]), by= city]
> city salary_sum salary_avg name
> 1: New York 11600 5800 Bartosz
> 2: San Francisco 2500 2500 Maciej
您的数据集:
mydf = data.frame(city=c("New York", "New York", "San Francisco"),
salary=c(4000, 7600, 2500),
name=c("Bartosz", "Damian", "Maciej"))