我是r的新手,我必须计算由":"分隔的列中的值。
数据集中有4个类别,我必须计算每个类别的操作数。每个log_id代表一个类别中的唯一操作。如果一个log_id有两个或更多类别,则意味着该特定操作将计入所有提及的类别。
数据看起来像这样
user_id log_id categories
001 1334 Perform:Sport_Well:Com.Tent
001 1323 Com.Tent
001 1212 Active
002 1113 NA
002 1478 Com.Tent:Active
002 1134 Sport_Well:Perform
002 1256 Perform
002 1590 Perform
002 1345 NA
002 1478 Com.Tent
002 1134 Sport_Well:Perform
002 1256 Perform
003 1590 Perform
003 1345 Active:Perform
003 1190 Perform:Com.Tent
003 1239 Active:Perform
这里是dput
:
dat <- structure(list(user_id = c("001", "001", "001", "002", "002",
"002", "002", "002", "002", "002", "002", "002", "003", "003",
"003", "003"), log_id = c("1334", "1323", "1212", "1113", "1478",
"1134", "1256", "1590", "1345", "1478", "1134", "1256", "1590",
"1345", "1190", "1239"), categories = c("Perform:Sport_Well:Com.Tent",
"Com.Tent", "Active", NA, "Com.Tent:Active", "Sport_Well:Perform",
"Perform", "Perform", NA, "Com.Tent", "Sport_Well:Perform", "Perform",
"Perform", "Active:Perform", "Perform:Com.Tent", "Active:Perform")),
.Names = c("user_id", "log_id", "categories"), class = "data.frame", row.names = c(NA, -16L))
所需的输出如下:
user_id category NumActions
001 Perform 1
001 Sport_Well 1
001 Com.Tent 2
001 Active 1
002 Com.Tent 2
002 Active 1
002 Perform 5
002 Sport_Well 2
003 Com.Tent 2
003 Active 2
003 Perform 4
我正在尝试拆分类别,但无法弄清楚如何计算多个类别的log_ids。
df$cate = str_split(string = df$Ch_Category, pattern = ":")
答案 0 :(得分:3)
dplyr 这是一个dplyr解决方案:
library(dplyr)
dat %>%
group_by(user_id) %>%
do(strsplit(.$categories, ":") %>%
unlist %>%
table(dnn = "category") %>%
as.data.frame(responseName = "numActions", stringsAsFactors = FALSE))
给出:
Source: local data frame [11 x 3]
Groups: user_id
user_id categories numActions
1 001 Active 1
2 001 Com.Tent 2
3 001 Perform 1
4 001 Sport_Well 1
5 002 Active 1
6 002 Com.Tent 2
7 002 Perform 5
8 002 Sport_Well 2
9 003 Active 2
10 003 Com.Tent 1
11 003 Perform 4
请注意,如果您不关心标题名称,那么我们可以省略dnn=...
和responseName=...
,如果可以忽略的警告可以,那么我们可以省略stringsAsFactors=...
所以有了这些警告,它可以缩短为:
dat %>%
group_by(user_id) %>%
do(strsplit(.$categories, ":") %>% unlist %>% table %>% as.data.frame)
data.table 这可以在data.table
中以类似方式完成:
library(data.table)
DT <- data.table(dat)
DT[, as.data.frame(table(unlist(strsplit(categories, ":")), dnn = "categories"),
responseName = "numActions"), by = user_id]
以及缩短的最后一句话,但警告列名称不相同:
DT[, as.data.frame(table(unlist(strsplit(categories, ":")))), by = user_id]
答案 1 :(得分:2)
拆分列中的字符串,将行添加到临时数据框中,然后进行计数。此示例使用dplyr
惯用法,但如果您无法使用dplyr
,我相信其他人会发布基本R解决方案:
library(dplyr)
cats <- strsplit(dat$categories, ":")
tmp <- data.frame(user_id = rep(dat$user_id, sapply(cats, length)), categories = unlist(cats))
tmp %>%
group_by(user_id, categories) %>%
summarise(NumActions=n()) %>%
ungroup
## user_id categories NumActions
## 1 001 Active 1
## 2 001 Com.Tent 2
## 3 001 Perform 1
## 4 001 Sport_Well 1
## 5 002 Active 1
## 6 002 Com.Tent 2
## 7 002 Perform 5
## 8 002 Sport_Well 2
## 9 002 NA 2
## 10 003 Active 2
## 11 003 Com.Tent 1
## 12 003 Perform 4
答案 2 :(得分:2)
我今天一直在玩 tidyr ,所以这是使用该软件包的解决方案。
首先我separate
将合并后的列分成三个。我使用gather
将结果数据集重新整形为长格式(删除缺失值)。然后,我使用 dplyr group_by
和summarise
将每个组的数字相加。
library(tidyr)
library(dplyr)
将一列分成三列:
dat %>%
separate(categories, c("a", "b", "c"), sep = ":", extra = "merge")
user_id log_id a b c
1 001 1334 Perform Sport_Well Com.Tent
2 001 1323 Com.Tent <NA> <NA>
3 001 1212 Active <NA> <NA>
4 002 1113 <NA> <NA> <NA>
5 002 1478 Com.Tent Active <NA>
6 002 1134 Sport_Well Perform <NA>
7 002 1256 Perform <NA> <NA>
8 002 1590 Perform <NA> <NA>
9 002 1345 <NA> <NA> <NA>
10 002 1478 Com.Tent <NA> <NA>
11 002 1134 Sport_Well Perform <NA>
12 002 1256 Perform <NA> <NA>
13 003 1590 Perform <NA> <NA>
14 003 1345 Active Perform <NA>
15 003 1190 Perform Com.Tent <NA>
16 003 1239 Active Perform <NA>
制作长格式(类别一列):
dat %>%
separate(categories, c("a", "b", "c"), sep = ":", extra = "merge") %>%
gather(variable, category, a:c, na.rm = TRUE)
user_id log_id variable category
1 001 1334 a Perform
2 001 1323 a Com.Tent
3 001 1212 a Active
4 002 1478 a Com.Tent
5 002 1134 a Sport_Well
6 002 1256 a Perform
7 002 1590 a Perform
...
然后按user_id
和category
进行分组,并计算每组中的数字。
dat %>%
separate(categories, c("a", "b", "c"), sep = ":", extra = "merge") %>%
gather(variable, category, a:c, na.rm = TRUE) %>%
group_by(user_id, category) %>%
summarise(NumActions = n())
user_id category NumActions
1 001 Active 1
2 001 Com.Tent 2
3 001 Perform 1
4 001 Sport_Well 1
5 002 Active 1
6 002 Com.Tent 2
7 002 Perform 5
8 002 Sport_Well 2
9 003 Active 2
10 003 Com.Tent 1
11 003 Perform 4
答案 3 :(得分:1)
以下基本R代码提供相同的输出但格式不同:
> aa = aggregate(categories~user_id, data=dat, function(x) paste(x,collapse=':'))
> sapply(sapply(split(aa, aa$user_id), function(x) strsplit(x$categories, ':') ), table )
$`001`
Active Com.Tent Perform Sport_Well
1 2 1 1
$`002`
Active Com.Tent Perform Sport_Well
1 2 5 2
$`003`
Active Com.Tent Perform
2 1 4
答案 4 :(得分:1)
您可以在“data.table”中使用my cSplit
function和.N
,如下所示:
cSplit(dat, "categories", ":", "long")[, list(NumActions = .N),
by = list(user_id, categories)]
# user_id categories NumActions
# 1: 001 Perform 1
# 2: 001 Sport_Well 1
# 3: 001 Com.Tent 2
# 4: 001 Active 1
# 5: 002 NA 2
# 6: 002 Com.Tent 2
# 7: 002 Active 1
# 8: 002 Sport_Well 2
# 9: 002 Perform 5
# 10: 003 Perform 4
# 11: 003 Active 2
# 12: 003 Com.Tent 1
请注意,这也会计算您可能想要或不想要的NA
。如果你不想要它。删除这些值只需要一个简单的na.omit
。要删除NA
“类别”,只需将以下内容添加到上述命令的末尾:
[!is.na(categories)]