我需要通过分组变量交叉制表多个响应(存储为一组变量)。我的调查问题是:“你有以下哪种水果?”然后,来自地理区域1或区域2的受访者被给予一个列表,其中包含“1. Orange,2。Mango,...”,来自是(1)或否(0)问题的结果数据是:
set.seed(1)
df <- data.frame(area=rep(c('Area 1','Area 2'), each=6),
var_orange=sample(0:1, 12, T),
var_banana=sample(0:1, 12, T),
var_melon=sample(0:1, 12, T),
var_mango=sample(0:1, 12, T))
area var_orange var_banana var_melon var_mango
1 Area 1 0 1 0 1
2 Area 1 0 0 0 0
3 Area 1 1 1 0 1
4 Area 1 1 0 0 0
5 Area 1 0 1 1 1
6 Area 1 1 1 0 1
7 Area 2 1 0 0 1
8 Area 2 1 1 1 1
9 Area 2 1 1 0 1
10 Area 2 0 0 0 1
11 Area 2 0 1 1 0
12 Area 2 0 0 1 0
我想在Stata中获得这样的摘要输出:
| area
| Area 1 Area 2 | Total
------------+------------------------+-----------
var_orange | 50.00 50.00 | 50.00
var_banana | 66.67 50.00 | 58.33
var_melon | 16.67 50.00 | 33.33
var_mango | 66.67 66.67 | 66.67
------------+------------------------+-----------
Total | 200.00 216.67 | 208.33
我找到了一个带有multfreqtable函数的相关post,它为我的数据提供单向摘要:
multfreqtable = function(data, question.prefix) {
z = length(question.prefix)
temp = vector("list", z)
for (i in 1:z) {
a = grep(question.prefix[i], names(data))
b = sum(data[, a] != 0)
d = colSums(data[, a] )
e = sum(rowSums(data[,a]) !=0)
f = as.numeric(c(d, b))
temp[[i]] = data.frame(question = c(sub(question.prefix[i],
"", names(d)), "Total"),
freq = f,
percent_response = (f/b)*100,
percent_cases = round((f/e)*100, 2))
names(temp)[i] = question.prefix[i]
}
temp
}
multfreqtable(df, "var_")
$var_
question freq percent_response percent_cases
1 orange 6 24 54.55
2 banana 7 28 63.64
3 melon 4 16 36.36
4 mango 8 32 72.73
5 Total 25 100 227.27
但我对双向总结很感兴趣。
我可以按post中的建议使用dplyr
并获取:
df %>%
summarise(orange_pct=round(sum(var_orange,na.rm=TRUE)*100/n(),2),
banana_pct=round(sum(var_banana,na.rm=TRUE)*100/n(),2),
melon_pct=round(sum(var_melon,na.rm=TRUE)*100/n(),2),
mango_pct=round(sum(var_mango,na.rm=TRUE)*100/n(),2))
orange_pct banana_pct melon_pct mango_pct
1 50 58.33 33.33 66.67
但是我需要一个具有边缘列频率的更整洁的表输出。
答案 0 :(得分:0)
使用 select case
When PROTOCOLID = 61002 AND TRANS_STATS_TYPE = 3
AND (CAUSE_CODE is NULL OR CAUSE_CATEGORY="S") THEN "S11 Update Successes"
When PROTOCOLID = 61002 AND TRANS_STATS_TYPE = 3 THEN "S11 Session Successes"
When PROTOCOLID = 61002 AND TRANS_STATS_TYPE = 7 THEN "Modify Access Bearer Successes"
ELSE 'no value'
End kpi_name
, Sum(count) cnt
.......
的其他解决方案是
aggregate
感谢@rawr建议简化使用T1 = aggregate(df[,2:5], list(df$area), sum)
rownames(T1) = T1[,1]
T1 = t(T1[,-1])
T1 = addmargins(T1, 1:2, FUN = c(Total = sum), quiet=TRUE)
T1
Area 1 Area 2 Total
var_orange 3 3 6
var_banana 4 3 7
var_melon 1 3 4
var_mango 4 4 8
Total 12 13 25
。
如果您希望表格以百分比而非计数表示,只需除以总计数即可获得分数,然后更改为百分比。
addmargins
答案 1 :(得分:0)
您可以先使用dplyr
计算值,然后使用例如{...}}将它们放入表格中。 knitr::kable
。
library(dplyr)
library(knitr)
set.seed(1)
df <- data.frame(area = rep(c('Area 1','Area 2'), each = 6),
var_orange = sample(0:1, 12, T),
var_banana = sample(0:1, 12, T),
var_melon = sample(0:1, 12, T),
var_mango = sample(0:1, 12, T))
t1 <- df %>% group_by(area) %>% summarise_each(funs(mean))
t2 <- df %>% summarise_each(funs(mean))
kable(rbind(t1, t2))
你会得到:
|area | var_orange| var_banana| var_melon| var_mango|
|:------|----------:|----------:|---------:|---------:|
|Area 1 | 0.5| 0.6666667| 0.1666667| 0.6666667|
|Area 2 | 0.5| 0.5000000| 0.5000000| 0.6666667|
|NA | 0.5| 0.5833333| 0.3333333| 0.6666667|
进一步优化输出以模仿Stata:
polished <- 100 * rbind(t1, t2) %>% # Use percentages
select(-area) %>% # Drop "area"
mutate(Total = rowSums(.[])) %>% # Add Total
as.matrix %>% t
kable(polished, digits = 2, col.names = c("Area 1", "Area 2", "Total"))
最终结果将是:
| | Area 1| Area 2| Total|
|:----------|------:|------:|------:|
|var_orange | 50.00| 50.00| 50.00|
|var_banana | 66.67| 50.00| 58.33|
|var_melon | 16.67| 50.00| 33.33|
|var_mango | 66.67| 66.67| 66.67|
|Total | 200.00| 216.67| 208.33|