我知道这个问题的答案很简单,但是我已经在各个论坛中进行了广泛搜索,但一直找不到解决方案。
我有一列名为Data_source
的列,这是我要将变量分组的一个因素。
我有一系列symptom*
变量,我想要根据Data_source
进行计数。
由于某种原因,我无法弄清楚该怎么做。普通的group_by
函数似乎无法正常工作。
这是有问题的数据框
df <- wrapr::build_frame(
"Data_source" , "Sex" , "symptoms_decLOC", "symptoms_nausea_vomitting" |
"1" , "Female", NA_character_ , NA_character_ |
"1" , "Female", NA_character_ , NA_character_ |
"1" , "Female", "No" , NA_character_ |
"1" , "Female", "Yes" , "No" |
"1" , "Female", "Yes" , "No" |
"1" , "Female", "Yes" , "No" |
"1" , "Male" , "Yes" , "No" |
"1" , "Female", "Yes" , "No" |
"2" , "Female", NA_character_ , NA_character_ |
"2" , "Male" , NA_character_ , NA_character_ |
"2" , "Male" , NA_character_ , NA_character_ |
"2" , "Female", "Yes" , "No" |
"2" , "Female", "Yes" , "No" |
"2" , "Male" , NA_character_ , NA_character_ |
"2" , "Male" , NA_character_ , NA_character_ |
"2" , "Male" , NA_character_ , NA_character_ |
"2" , "Female", NA_character_ , NA_character_ |
"2" , "Female", NA_character_ , NA_character_ |
"2" , "Male" , NA_character_ , NA_character_ |
"2" , "Female", NA_character_ , NA_character_ )
请注意,性别和症状变量都是包括NA在内的所有因素。我尝试了以下
df %>% na.omit() %>% group_by(Data_source) %>% count("symptoms_decLOC")
这行不通,并且不是最佳选择,因为我必须对每一列都重复一次。理想的做法是使用类似于lapply(df, count)
的名称,但这并不能给我每个组的描述。
编辑
为回答以下问题,我添加了预期的输出。我已经在excel中对此进行了编辑,为清楚起见,对group_by
进行了颜色编码。
请注意,我如何对每个可能的答案进行细分。当我使用dplyr
运行此命令时,输出如下。
> df %>% na.omit() %>% group_by(Data_source) %>% count("symptoms_decLOC")
# A tibble: 2 x 3
# Groups: Data_source [2]
Data_source `"symptoms_decLOC"` n
<chr> <chr> <int>
1 1 symptoms_decLOC 5
2 2 symptoms_decLOC 2
答案 0 :(得分:1)
这是最有效的方法:尚未弄清如何包括零计数组...应该添加.drop=FALSE takes care of this,但对我不起作用(使用std::vector
v。0.8 .0.9001)。
dplyr
结果:
library(dplyr)
library(tidyr)
(df
%>% tidyr::gather(var,val,-Data_source)
%>% count(Data_source,var,val, .drop=FALSE)
%>% na.omit()
)
答案 1 :(得分:1)
使用@Ben Bolker的答案来获取每个组的计数,并使用spread
和gather
包括零计数组。
dplyr
library(dplyr)
library(tidyr)
# Count number of occurences by Data_source
df2 <-
df %>%
gather(variable, value, -Data_source) %>%
count(Data_source, variable, value, name = "counter") %>%
na.omit()
# For variable = "Sex", leave as is
# For everything else, in this case symptom* convert into factor to include zero count group
# Then spread with dataframe will NAs filled with 0, re-convert back to long to bind rows
bind_rows(df2 %>%
filter(variable == "Sex"),
df2 %>%
filter(variable != "Sex") %>%
mutate(value = factor(value, levels = c("Yes", "No"))) %>%
spread(key = value, value = counter, fill = 0) %>%
gather(value, counter, -Data_source, -variable)) %>%
arrange(Data_source, variable)
data.table
library(data.table)
dt <- data.table(df)
# Melt data by Data source
dt_melt <- melt(dt, id.vars = "Data_source", value.factor = FALSE, variable.factor = FALSE)
# Add counter, if NA then 0 else 1
dt_melt[, counter := 0]
dt_melt[!is.na(value), counter := 1]
# Sum number of occurrences
dt_count <- dt_melt[,list(counter = sum(counter)), by = c("Data_source", "variable", "value")]
# Split into two dt
dt2a <- dt_count[variable == "Sex", ]
dt2b <- dt_count[variable != "Sex" ,]
# only on symptoms variables
# Convert into factor variable
dt2b$value <- factor(dt2b$value, levels = c("Yes", "No"))
dt2b_dcast <- dcast(data = dt2b, formula = Data_source + variable ~ value, value.var = "counter", fill = 0, drop = FALSE)
dt2b_melt <- melt(dt2b_dcast, id.vars = c("Data_source", "variable"), variable.name = "value", value.name = "counter")
# combine
combined_d <- rbind(dt2a, dt2b_melt)
combined_d[order(Data_source, variable), ]
答案 2 :(得分:0)
我不太了解您的要求,但我假设您要计算每个symptom_*
列中非NA值的数量。
这是一个data.table
解决方案:
# load library
library(data.table)
# Suppose the table is called "dt". Convert it to a data.table:
setDT(dt)
# convert the wide table to a long one, filter the values that
# aren't NA and count both, by Data_source and by variable
# (variable is the created column with the symptom_* names)
melt(dt, id.vars = 1:2)[!is.na(value),
.N,
by = .(Data_source, variable)]
代码的每个部分在做什么:
melt(dt, id.vars = 1:2)
将dt
从宽转换为长,并将第1列和第2列(Data_source和sex
)保持固定。
!is.na(value)
过滤不是symptom_*
的值(以前在每个NA
标题下)。
.N
对行进行计数。
by = .(Data_source, variable)
是我们用来计算的分组。 variable
是symptom_*
在整形期间到达的列的名称。
答案 3 :(得分:0)
绝对,困难的是保持数据中不存在的组合...这是分两个步骤的解决方案:
1。准备没有计数的数据库
您可以做任何您想做的事,但是我选择计算两个块,因为变量Sex
的方式不同。无需在此处绑定这些块。
chunk1 <- expand.grid(
Data_source = c("1", "2"),
name = c("symptoms_decLOC", "symptoms_nausea_vomitting"),
value = c("Yes", "No"),
stringsAsFactors = FALSE
)
chunk2 <- expand.grid(
Data_source = c("1", "2"),
name = "Sex",
value = c("Female", "Male"),
stringsAsFactors = FALSE
)
2。完成要求的工作
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = c("Sex", "symptoms_decLOC", "symptoms_nausea_vomitting"))%>%
group_by(Data_source, name, value) %>%
summarise(count = n()) %>%
right_join(bind_rows(chunk1, chunk2), by = c("Data_source", "name", "value")) %>%
arrange(Data_source, name) %>%
mutate(count = zoo::na.fill(count, 0))
Etvoilà
# A tibble: 12 x 4
# Groups: Data_source, name [6]
Data_source name value count
<chr> <chr> <chr> <int>
1 1 Sex Female 7
2 1 Sex Male 1
3 1 symptoms_decLOC Yes 5
4 1 symptoms_decLOC No 1
5 1 symptoms_nausea_vomitting Yes 0
6 1 symptoms_nausea_vomitting No 5
7 2 Sex Female 6
8 2 Sex Male 6
9 2 symptoms_decLOC Yes 2
10 2 symptoms_decLOC No 0
11 2 symptoms_nausea_vomitting Yes 0
12 2 symptoms_nausea_vomitting No 2
它不是那么短,但是它使用简单的功能。该过程类似于在Excel中可以完成的过程,即准备结构,然后完成计数。
我希望它可以帮助;-)