计算由"分隔的列中的值:"

时间:2014-10-12 23:03:28

标签: r

我是r的新手,我必须计算由":"分隔的列中的值。

数据集中有4个类别,我必须计算每个类别的操作数。每个log_id代表一个类别中的唯一操作。如果一个log_id有两个或更多类别,则意味着该特定操作将计入所有提及的类别。

数据看起来像这样

user_id   log_id  categories
  001     1334    Perform:Sport_Well:Com.Tent
  001     1323    Com.Tent
  001     1212    Active
  002     1113    NA
  002     1478    Com.Tent:Active
  002     1134    Sport_Well:Perform
  002     1256    Perform
  002     1590    Perform
  002     1345    NA
  002     1478    Com.Tent
  002     1134    Sport_Well:Perform
  002     1256    Perform
  003     1590    Perform
  003     1345    Active:Perform
  003     1190    Perform:Com.Tent
  003     1239    Active:Perform

这里是dput

dat <- structure(list(user_id = c("001", "001", "001", "002", "002", 
  "002", "002", "002", "002", "002", "002", "002", "003", "003", 
  "003", "003"), log_id = c("1334", "1323", "1212", "1113", "1478", 
  "1134", "1256", "1590", "1345", "1478", "1134", "1256", "1590", 
  "1345", "1190", "1239"), categories = c("Perform:Sport_Well:Com.Tent", 
  "Com.Tent", "Active", NA, "Com.Tent:Active", "Sport_Well:Perform", 
  "Perform", "Perform", NA, "Com.Tent", "Sport_Well:Perform", "Perform", 
  "Perform", "Active:Perform", "Perform:Com.Tent", "Active:Perform")), 
  .Names = c("user_id", "log_id", "categories"), class = "data.frame", row.names = c(NA, -16L))

所需的输出如下:

user_id   category        NumActions
  001     Perform             1
  001     Sport_Well          1
  001     Com.Tent            2
  001     Active              1
  002     Com.Tent            2
  002     Active              1
  002     Perform             5
  002     Sport_Well          2
  003     Com.Tent            2
  003     Active              2
  003     Perform             4

我正在尝试拆分类别,但无法弄清楚如何计算多个类别的log_ids。

df$cate = str_split(string = df$Ch_Category, pattern = ":")

5 个答案:

答案 0 :(得分:3)

dplyr 这是一个dplyr解决方案:

library(dplyr)

dat %>% 
   group_by(user_id) %>% 
   do(strsplit(.$categories, ":") %>% 
        unlist %>% 
        table(dnn = "category") %>% 
        as.data.frame(responseName = "numActions", stringsAsFactors = FALSE))

给出:

Source: local data frame [11 x 3]
Groups: user_id

   user_id categories numActions
1      001     Active          1
2      001   Com.Tent          2
3      001    Perform          1
4      001 Sport_Well          1
5      002     Active          1
6      002   Com.Tent          2
7      002    Perform          5
8      002 Sport_Well          2
9      003     Active          2
10     003   Com.Tent          1
11     003    Perform          4

请注意,如果您不关心标题名称,那么我们可以省略dnn=...responseName=...,如果可以忽略的警告可以,那么我们可以省略stringsAsFactors=...所以有了这些警告,它可以缩短为:

dat %>% 
   group_by(user_id) %>% 
   do(strsplit(.$categories, ":") %>% unlist %>% table %>% as.data.frame)

data.table 这可以在data.table中以类似方式完成:

library(data.table)
DT <- data.table(dat)
DT[, as.data.frame(table(unlist(strsplit(categories, ":")), dnn = "categories"),
                 responseName = "numActions"), by = user_id]

以及缩短的最后一句话,但警告列名称不相同:

DT[, as.data.frame(table(unlist(strsplit(categories, ":")))), by = user_id]

答案 1 :(得分:2)

拆分列中的字符串,将行添加到临时数据框中,然后进行计数。此示例使用dplyr惯用法,但如果您无法使用dplyr,我相信其他人会发布基本R解决方案:

library(dplyr)

cats <- strsplit(dat$categories, ":")
tmp <- data.frame(user_id = rep(dat$user_id, sapply(cats, length)), categories = unlist(cats))
tmp %>% 
  group_by(user_id, categories) %>% 
  summarise(NumActions=n()) %>% 
  ungroup

##    user_id categories NumActions
## 1      001     Active          1
## 2      001   Com.Tent          2
## 3      001    Perform          1
## 4      001 Sport_Well          1
## 5      002     Active          1
## 6      002   Com.Tent          2
## 7      002    Perform          5
## 8      002 Sport_Well          2
## 9      002         NA          2
## 10     003     Active          2
## 11     003   Com.Tent          1
## 12     003    Perform          4

答案 2 :(得分:2)

我今天一直在玩 tidyr ,所以这是使用该软件包的解决方案。

首先我separate将合并后的列分成三个。我使用gather将结果数据集重新整形为长格式(删除缺失值)。然后,我使用 dplyr group_bysummarise将每个组的数字相加。

library(tidyr)
library(dplyr)

将一列分成三列:

dat %>% 
    separate(categories, c("a", "b", "c"), sep = ":", extra = "merge")

   user_id log_id          a          b        c
1      001   1334    Perform Sport_Well Com.Tent
2      001   1323   Com.Tent       <NA>     <NA>
3      001   1212     Active       <NA>     <NA>
4      002   1113       <NA>       <NA>     <NA>
5      002   1478   Com.Tent     Active     <NA>
6      002   1134 Sport_Well    Perform     <NA>
7      002   1256    Perform       <NA>     <NA>
8      002   1590    Perform       <NA>     <NA>
9      002   1345       <NA>       <NA>     <NA>
10     002   1478   Com.Tent       <NA>     <NA>
11     002   1134 Sport_Well    Perform     <NA>
12     002   1256    Perform       <NA>     <NA>
13     003   1590    Perform       <NA>     <NA>
14     003   1345     Active    Perform     <NA>
15     003   1190    Perform   Com.Tent     <NA>
16     003   1239     Active    Perform     <NA>

制作长格式(类别一列):

dat %>% 
    separate(categories, c("a", "b", "c"), sep = ":", extra = "merge") %>%
    gather(variable, category, a:c, na.rm = TRUE)

   user_id log_id variable   category
1      001   1334        a    Perform
2      001   1323        a   Com.Tent
3      001   1212        a     Active
4      002   1478        a   Com.Tent
5      002   1134        a Sport_Well
6      002   1256        a    Perform
7      002   1590        a    Perform
...

然后按user_idcategory进行分组,并计算每组中的数字。

dat %>% 
separate(categories, c("a", "b", "c"), sep = ":", extra = "merge") %>%
gather(variable, category, a:c, na.rm = TRUE) %>%
group_by(user_id, category) %>%
summarise(NumActions = n())

   user_id   category NumActions
1      001     Active          1
2      001   Com.Tent          2
3      001    Perform          1
4      001 Sport_Well          1
5      002     Active          1
6      002   Com.Tent          2
7      002    Perform          5
8      002 Sport_Well          2
9      003     Active          2
10     003   Com.Tent          1
11     003    Perform          4

答案 3 :(得分:1)

以下基本R代码提供相同的输出但格式不同:

> aa = aggregate(categories~user_id, data=dat, function(x) paste(x,collapse=':'))
> sapply(sapply(split(aa, aa$user_id), function(x) strsplit(x$categories, ':')  ), table )
$`001`

    Active   Com.Tent    Perform Sport_Well 
         1          2          1          1 

$`002`

    Active   Com.Tent    Perform Sport_Well 
         1          2          5          2 

$`003`

  Active Com.Tent  Perform 
       2        1        4 

答案 4 :(得分:1)

您可以在“data.table”中使用my cSplit function.N,如下所示:

cSplit(dat, "categories", ":", "long")[, list(NumActions = .N), 
                                       by = list(user_id, categories)]
#     user_id categories NumActions
#  1:     001    Perform          1
#  2:     001 Sport_Well          1
#  3:     001   Com.Tent          2
#  4:     001     Active          1
#  5:     002         NA          2
#  6:     002   Com.Tent          2
#  7:     002     Active          1
#  8:     002 Sport_Well          2
#  9:     002    Perform          5
# 10:     003    Perform          4
# 11:     003     Active          2
# 12:     003   Com.Tent          1

请注意,这也会计算您可能想要或不想要的NA。如果你不想要它。删除这些值只需要一个简单的na.omit。要删除NA“类别”,只需将以下内容添加到上述命令的末尾:

[!is.na(categories)]