计算不同条件下R中字符串的出现次数

时间:2014-07-15 11:06:10

标签: r aggregate reshape

我有一个数据框,有多个名为“data”的列,如下所示:

Preferences  Status      Gender  
8a 8b 9a     Employed    Female  
10b 11c 9b   Unemployed  Male  
11a 11c 8e   Student     Female  

也就是说,每位客户都选择了3个偏好并指定了其他信息,例如状态和性别。每个偏好由[数字] [字母]组合给出,并且有c。 30种可能的偏好。可能的偏好是:

8[a - c]  
9[a - k]  
10[a - d]  
11[a - c]  
12[a - i]  

我想在其他列的某些条件下计算每个首选项的出现次数 - 例如。适合所有女性。

理想情况下,输出将是一个如下所示的数据框:

Preference   Female  Male  Employed  Unemployed  Student
8a           1034    934   234       495         203
8b           539     239   609       394         235
8c           124     395   684       94          283
9a           120     999   895       945         345
9b           978     385   596       923         986

等。

实现这一目标的最有效方法是什么? 感谢。

1 个答案:

答案 0 :(得分:2)

我假设你开始的东西看起来像这样:

mydf <- structure(list(
  Preferences = c("8a 8b 9a", "10b 11c 9b", "11a 11c 8e"), 
  Status = c("Employed", "Unemployed", "Student"), 
  Gender = c("Female", "Male", "Female")), 
  .Names = c("Preferences", "Status", "Gender"), 
  class = c("data.frame"), row.names = c(NA, -3L))
mydf
#   Preferences     Status Gender
# 1    8a 8b 9a   Employed Female
# 2  10b 11c 9b Unemployed   Male
# 3  11a 11c 8e    Student Female

如果是这样,你需要&#34;拆分&#34; &#34;偏好&#34;列(按空格),将数据转换为&#34; long&#34;表格,然后将其重新整形为宽幅表格,同时制作表格。

使用正确的工具,这非常简单。

library(devtools)
library(data.table)
library(reshape2)
source_gist(11380733) # for `cSplit`

dcast.data.table(                                # Step 3--aggregate to wide form
  melt(                                          # Step 2--convert to long form
    cSplit(mydf, "Preferences", " ", "long"),    # Step 1--split "Preferences"
    id.vars = "Preferences"), 
  Preferences ~ value, fun.aggregate = length)
#    Preferences Employed Female Male Student Unemployed
# 1:         10b        0      0    1       0          1
# 2:         11a        0      1    0       1          0
# 3:         11c        0      1    1       1          1
# 4:          8a        1      1    0       0          0
# 5:          8b        1      1    0       0          0
# 6:          8e        0      1    0       1          0
# 7:          9a        1      1    0       0          0
# 8:          9b        0      0    1       0          1

我还尝试了dplyr + tidyr方法,如下所示:

library(dplyr)
library(tidyr)

mydf %>%
  separate(Preferences, c("P_1", "P_2", "P_3")) %>% ## splitting things
  gather(Pref, Pvals, P_1:P_3) %>%      # stack the preference columns
  gather(Var, Val, Status:Gender) %>%   # stack the status/gender columns
  group_by(Pvals, Val) %>%              # group by these new columns
  summarise(count = n()) %>%            # aggregate the numbers of each
  spread(Val, count)                    # spread the values out
# Source: local data table [8 x 6]
# Groups: 
# 
#   Pvals Employed Female Male Student Unemployed
# 1   10b       NA     NA    1      NA          1
# 2   11a       NA      1   NA       1         NA
# 3   11c       NA      1    1       1          1
# 4    8a        1      1   NA      NA         NA
# 5    8b        1      1   NA      NA         NA
# 6    8e       NA      1   NA       1         NA
# 7    9a        1      1   NA      NA         NA
# 8    9b       NA     NA    1      NA          1

这两种方法实际上都很快。使用比您共享的更好的样本数据进行测试,如下所示:

preferences <- c(paste0(8, letters[1:3]),
                 paste0(9, letters[1:11]),
                 paste0(10, letters[1:4]),
                 paste0(11, letters[1:3]),
                 paste0(12, letters[1:9]))
set.seed(1)
nrow <- 10000

mydf <- data.frame(
  Preferences = vapply(replicate(nrow, 
                                 sample(preferences, 3, FALSE), 
                                 FALSE), 
                       function(x) paste(x, collapse = " "), 
                       character(1L)),
  Status = sample(c("Employed", "Unemployed", "Student"), nrow, TRUE),
  Gender = sample(c("Male", "Female"), nrow, TRUE)
)