Question

我有一个看起来像这样的数据集

ID        Q1 Q2 Q3
Person1   A  C  NA
Person2   B  C  D
Person3   A  C  A

实质上，它是对一组多项选择题的回答的表。

我一直在尝试找到一种方法，可以在R中为每个人生成响应配置文件。

最终输出看起来像：

           A    B    C   D   NA
Person1   .33   0  .33   0  .33
Person2    0   .33 .33  .33  0
Person3   .66   0  .33   0   0

我已经尝试过使用crosstab（）函数以及使用dplyr和tidyr移动内容的各种方式。我还用Google搜索了“ R频率表”的每个变体，都没有太大的成功。

我错过了一些非常明显的方法吗？

Answer 1

这里是tidyverse的一种方式-

df %>% 
  gather(var, value, -ID) %>% 
  replace_na(list(value = "Missing")) %>% 
  count(ID, value) %>% 
  group_by(ID) %>% 
  mutate(
    prop = n/sum(n)
  ) %>% 
  select(-n) %>% 
  spread(value, prop, fill = 0)

# A tibble: 3 x 6
# Groups:   ID [3]
  ID          A     B     C     D Missing
  <chr>   <dbl> <dbl> <dbl> <dbl>   <dbl>
1 Person1 0.333 0     0.333 0       0.333
2 Person2 0     0.333 0.333 0.333   0    
3 Person3 0.667 0     0.333 0       0

数据-

df <- read.table(text = "ID Q1 Q2 Q3
Person1 A C NA
Person2 B C D
Person3 A C A", header = T, sep = " ", stringsAsFactors = F)

Answer 2

这与Shree相似，只是带有步骤注释

library(tidyverse)

df <-
  tibble(
    ID = paste0("Person", 1:3),
    Q1 = c("A", "B", "A"),
    Q2 = rep("C", 3),
    Q3 = c(NA, "D", "A")
  )

df %>% 
  # this will flip the data from wide to long
  # and create 2 new columns "var" and "letter"
  # using all the columns not = ID
  gather(key = var, value = letter, -ID) %>%
  # count how many 
  group_by(ID) %>% 
  mutate(total = n()) %>% 
  ungroup() %>% 
  # groups by ID & letter & counts, creates a column "n" 
  # can also use a group by
  count(ID, letter, total) %>% 
  # do the math
  mutate(pct = round(n/total, 2)) %>% 
  # keep just these 3 columns
  select(ID, letter, pct) %>% 
  # the inverse of gather(). Will take the letter column to
  # make new columns for each unique value and will put the 
  # pct values underneath them. Any NA will become a 0
  spread(key = letter, value = pct, fill = 0)

#  ID          A     B     C     D `<NA>`
#  <chr>   <dbl> <dbl> <dbl> <dbl>  <dbl>
# Person1  0.33  0     0.33  0      0.33
# Person2  0     0.33  0.33  0.33   0   
# Person3  0.67  0     0.33  0      0

Answer 3

我先使用melt，然后使用table + prop.table

s=reshape2::melt(df,id.vars='ID')
s[is.na(s)]='NA'
prop.table(table(s$ID,as.character(s$value)),1)

                  A         B         C         D        NA
  Person1 0.3333333 0.0000000 0.3333333 0.0000000 0.3333333
  Person2 0.0000000 0.3333333 0.3333333 0.3333333 0.0000000
  Person3 0.6666667 0.0000000 0.3333333 0.0000000 0.0000000

生成R中所有观测值的多个类别变量水平的频率表

3 个答案: