我有一个数据框,有多个名为“data”的列,如下所示:
Preferences Status Gender
8a 8b 9a Employed Female
10b 11c 9b Unemployed Male
11a 11c 8e Student Female
也就是说,每位客户都选择了3个偏好并指定了其他信息,例如状态和性别。每个偏好由[数字] [字母]组合给出,并且有c。 30种可能的偏好。可能的偏好是:
8[a - c]
9[a - k]
10[a - d]
11[a - c]
12[a - i]
我想在其他列的某些条件下计算每个首选项的出现次数 - 例如。适合所有女性。
理想情况下,输出将是一个如下所示的数据框:
Preference Female Male Employed Unemployed Student
8a 1034 934 234 495 203
8b 539 239 609 394 235
8c 124 395 684 94 283
9a 120 999 895 945 345
9b 978 385 596 923 986
等。
实现这一目标的最有效方法是什么? 感谢。
答案 0 :(得分:2)
我假设你开始的东西看起来像这样:
mydf <- structure(list(
Preferences = c("8a 8b 9a", "10b 11c 9b", "11a 11c 8e"),
Status = c("Employed", "Unemployed", "Student"),
Gender = c("Female", "Male", "Female")),
.Names = c("Preferences", "Status", "Gender"),
class = c("data.frame"), row.names = c(NA, -3L))
mydf
# Preferences Status Gender
# 1 8a 8b 9a Employed Female
# 2 10b 11c 9b Unemployed Male
# 3 11a 11c 8e Student Female
如果是这样,你需要&#34;拆分&#34; &#34;偏好&#34;列(按空格),将数据转换为&#34; long&#34;表格,然后将其重新整形为宽幅表格,同时制作表格。
使用正确的工具,这非常简单。
library(devtools)
library(data.table)
library(reshape2)
source_gist(11380733) # for `cSplit`
dcast.data.table( # Step 3--aggregate to wide form
melt( # Step 2--convert to long form
cSplit(mydf, "Preferences", " ", "long"), # Step 1--split "Preferences"
id.vars = "Preferences"),
Preferences ~ value, fun.aggregate = length)
# Preferences Employed Female Male Student Unemployed
# 1: 10b 0 0 1 0 1
# 2: 11a 0 1 0 1 0
# 3: 11c 0 1 1 1 1
# 4: 8a 1 1 0 0 0
# 5: 8b 1 1 0 0 0
# 6: 8e 0 1 0 1 0
# 7: 9a 1 1 0 0 0
# 8: 9b 0 0 1 0 1
我还尝试了dplyr
+ tidyr
方法,如下所示:
library(dplyr)
library(tidyr)
mydf %>%
separate(Preferences, c("P_1", "P_2", "P_3")) %>% ## splitting things
gather(Pref, Pvals, P_1:P_3) %>% # stack the preference columns
gather(Var, Val, Status:Gender) %>% # stack the status/gender columns
group_by(Pvals, Val) %>% # group by these new columns
summarise(count = n()) %>% # aggregate the numbers of each
spread(Val, count) # spread the values out
# Source: local data table [8 x 6]
# Groups:
#
# Pvals Employed Female Male Student Unemployed
# 1 10b NA NA 1 NA 1
# 2 11a NA 1 NA 1 NA
# 3 11c NA 1 1 1 1
# 4 8a 1 1 NA NA NA
# 5 8b 1 1 NA NA NA
# 6 8e NA 1 NA 1 NA
# 7 9a 1 1 NA NA NA
# 8 9b NA NA 1 NA 1
这两种方法实际上都很快。使用比您共享的更好的样本数据进行测试,如下所示:
preferences <- c(paste0(8, letters[1:3]),
paste0(9, letters[1:11]),
paste0(10, letters[1:4]),
paste0(11, letters[1:3]),
paste0(12, letters[1:9]))
set.seed(1)
nrow <- 10000
mydf <- data.frame(
Preferences = vapply(replicate(nrow,
sample(preferences, 3, FALSE),
FALSE),
function(x) paste(x, collapse = " "),
character(1L)),
Status = sample(c("Employed", "Unemployed", "Student"), nrow, TRUE),
Gender = sample(c("Male", "Female"), nrow, TRUE)
)