我是R新手。很抱歉提出基本问题。 我有“数据”,包括三列(作为示例)命名数据$ engine,data $ unit和data $ AvailableLeft。 data $ AvailableLeft是dummy(0或1)。而对于数据$ engine的每个唯一值,可以有几个数据值$ unit。我想分别计算数据$ engine和data $ unit的数据$ AvailableLeft中“1”的百分比。我有几十万行,但只粘贴前13行。
data$engine data$unit data$AvailableLeft
10158 207 1
10158 207 0
10158 207 1
10158 207 0
10147 142 1
10147 142 1
10147 142 1
10147 142 0
10147 142 1
10147 142 0
10147 142 1
10161 244 0
10161 244 0
我希望以这种格式输出:
data$engine data$unit Percentage
10158 207 20%
10147 142 10%
10161 244 3%
. . .
. . .
. . .
我尝试了这些代码,但没有成功:
##calculate the percentage of "1s" for whole data and not for each data$engine and data$unit
sum(data$AvailableLeft==1)/length(data$AvailableLeft)
# tried to do it in parts but was not able to divide the two columns at last...
df11 <- data.frame(data$engine, data$unit, data$AvailbleLeft)
leftwarn1=aggregate(data$AvailableLeft ~ data$engine + data$unit, data = df11, sum) #Counting number of "1s" per unit per engine
leftwarn10 = count(data$AvailableLeft == 0, c("data$engine","data$unit")) #counting number of "1 and 0" per unit per engine
答案 0 :(得分:0)
dta <- read.table(text = "
data$engine data$unit data$AvailableLeft
10158 207 1
10158 207 0
10158 207 1
10158 207 0
10147 142 1
10147 142 1
10147 142 1
10147 142 0
10147 142 1
10147 142 0
10147 142 1
10161 244 0
10161 244 0",
header = TRUE)
# dta[, 3] for example, returns the third column.
# aggregate as its helpfile (?aggregate) says allows you to compute summary statistics of data subsets
aggregate(dta[, 3], by = list(dta[, 1], dta[, 2]), mean)
答案 1 :(得分:0)
我的解决方案虽然很长,但对我来说效果很好:
data<-read.table(paste0(file.path(Sys.getenv("USERPROFILE"),"Desktop"),
"/dta.txt"), header = TRUE) # I transcribed your examble above to notepad and
# saved it as dta.txt, so I could read the table in R
enginevalues<-unique(data$engine) # Unique values of "engine" column
unitvalues<-unique(data$unit) # Unique values of "unit" column
output<-matrix(ncol=3) # Matrix where I stored the outputs
digitsafterdot<-2 # Number of digits after dot (or comma, whathever)
# After, I did two for loops: one for "engine" and other for "unit" values
# You can understand it as a combinatorial analysis
for(eng in enginevalues){
dteng<-data[data[,"engine"]==eng,]
for(un in unitvalues){
dtunit<-dteng[dteng[,"unit"]==un,]
# Percentage: Number of 1's x 100 divided by the total number of AvailableLeft values
percentage<-round(sum(dtunit[,"AvailableLeft"] == 1)*100/nrow(dtunit),
digits=digitsafterdot)
# Division by zero is not allowed!
if(nrow(dtunit) == 0) percentage<-0
output<-rbind(output,c(eng,un,percentage))
}
}
output<-output[-1,] # Just removing the initial NA values
colnames(output)<-c("engine","unit","percentage") # Renaming the output
output
# engine unit percentage
# [1,] 10158 207 50.00
# [2,] 10158 142 0.00
# [3,] 10158 244 0.00
# [4,] 10147 207 0.00
# [5,] 10147 142 71.43
# [6,] 10147 244 0.00
# [7,] 10161 207 0.00
# [8,] 10161 142 0.00
# [9,] 10161 244 0.00
# Output without zero values
outputnozeros<-output[output[,"percentage"]!=0.00,]
outputnozeros
# engine unit percentage
# [1,] 10158 207 50.00
# [2,] 10147 142 71.43
@NBATrends的解决方案也可以正常工作并且非常紧凑,但是这里提供的解决方案可以为循环提供一些额外的控制。我认为这两种解决方案都像魅力一样。
答案 2 :(得分:0)
利用大家的建议,我用这种方式编写脚本,看起来有效(不确定):
df11 <- data.frame(data$engine, data$unit, data$AvailableLeft)
warn = aggregate(data$AvailableLeft ~ data$engine + data$unit, data = df11, mean)
有任何意见吗?
答案 3 :(得分:0)
如果您有大型数据框,请尝试使用data.table库。使用NBATrends创建的数据
library(data.table)
dta <- read.table(text = "
data$engine data$unit data$AvailableLeft
10158 207 1
10158 207 0
10158 207 1
10158 207 0
10147 142 1
10147 142 1
10147 142 1
10147 142 0
10147 142 1
10147 142 0
10147 142 1
10161 244 0
10161 244 0",
header = TRUE)
dt <- as.data.table(dta)
dt[,sum(data.AvailableLeft)*100/.N,.(data.engine,data.unit)]
data.engine data.unit V1
1: 10158 207 50.00000
2: 10147 142 71.42857
3: 10161 244 0.00000
对于您的要求,这应该更合适
dt[,paste(as.character(round(sum(data.AvailableLeft)*100/.N,2)),"%"),.(data.engine,data.unit)]
给出了
data.engine data.unit V1
1: 10158 207 50 %
2: 10147 142 71.43 %
3: 10161 244 0 %
要弄清楚如何在0
中获取data$AvailableLeft
的百分比应该是非常简单的,我会留给用户
答案 4 :(得分:-1)
尝试
subset(as.data.frame(with(df, prop.table(table(engine, unit, AvailableLeft))*100)), AvailableLeft==1, select=-AvailableLeft)
关于你的评论:
df <- read.table(col.names=c("engine", "unit", "left"), text="
10158 207 1
10158 207 0
10158 207 1
10158 207 0
10147 142 1
10147 142 1
10147 142 1
10147 142 0
10147 142 1
10147 142 0
10147 142 1
10161 244 0
10161 244 0")
subset(as.data.frame(with(df, prop.table(table(engine, unit, left))*100)), left==1, select=-left)
# engine unit Freq
# 10 10147 142 38.46154
# 11 10158 142 0.00000
# 12 10161 142 0.00000
# 13 10147 207 0.00000
# 14 10158 207 15.38462
# 15 10161 207 0.00000
# 16 10147 244 0.00000
# 17 10158 244 0.00000
# 18 10161 244 0.00000