dplyr计算某些文本

时间:2018-06-23 07:09:13

标签: r dplyr

在我的数据框中,我试图计算某些文本'000','xxx'而不是(000 | xxx)

我的数据框是这样的:

Name    per1    per2    per3    
a1      000      xxx    230    
a1      xxx      000    NA    
a2      000      340    xxx    
a3      000      xxx    NA

所需结果计数:

000 xxx Others    
a1  2   2   1    
a2  1   1   1    
a3  1   1   0

使用dplyr:我尝试了但出了错,请帮助实现该目标

df %>% groupby(Name) %>% filter(grepl('000')) %>% summarize(000 = n())

3 个答案:

答案 0 :(得分:2)

一种选择是将数据转换为long format,然后使用reshape2::dcast获取计数为:

library(tidyverse)
library(reshape2)

df %>% gather(key, value, -Name) %>%
  mutate(value = ifelse(is.na(value), "Others", value)) %>%
  dcast(Name~value, fun.aggregate = length)
#   Name 000 230 340 Others xxx
# 1   a1   2   1   0      1   2
# 2   a2   1   0   1      0   1
# 3   a3   1   0   0      1   1  

或::如果OP希望对000xxxOthers个类别进行计数,则:

library(tidyverse)
library(reshape2)

df %>% gather(key, value, -Name) %>%
  mutate(value = 
     ifelse(is.na(value) | !(value %in% c("000", "xxx")), "Others", value)) %>%
  dcast(Name~value, fun.aggregate = length)


#   Name 000 Others xxx
# 1   a1   2      2   2
# 2   a2   1      1   1
# 3   a3   1      1   1

数据:

df<-read.table(text="
Name per1 per2 per3
a1 000 xxx 230
a1 xxx 000 NA
a2 000 340 xxx
a3 000 xxx NA",
header=TRUE, stringsAsFactor = FALSE)    

答案 1 :(得分:1)

这里有一些tidyverse可能性,所有变化都基于相同的想法:

library(tidyverse)
df %>%
  nest(-Name) %>%
  rowwise %>%
  summarize(`000`  = sum(data =='000',na.rm=T),
            xxx    = sum(data =='xxx',na.rm=T),
            Others = sum(!is.na(data))-`000` - xxx)

df %>%
  nest(-Name) %>%
  group_by(Name) %>%
  summarize(`000`  = sum(data[[1]]=='000',na.rm=T),
            xxx    = sum(data[[1]]=='xxx',na.rm=T),
            Others = sum(!is.na(data[[1]]))-`000` - xxx)

df %>%
  group_by(Name) %>%
  do(tibble(`000`  = sum(.[-1]=='000',na.rm=T),
            xxx    = sum(.[-1]=='xxx',na.rm=T),
            Others = sum(!is.na(.[-1]))-`000` - xxx)) %>%
  ungroup

# # A tibble: 3 x 4
#   Name  `000`   xxx Others
#   <chr> <int> <int>  <int>
# 1 a1        2     2      1
# 2 a2        1     1      1
# 3 a3        1     1      0

请注意rowwise和按行分组的工作方式稍有不同。

这也是R的基础翻译:

do.call(
  rbind,
  by(df,df$Name,function(x) data.frame(
    Name   = x$Name[1],
    `000`  = sum(x[-1]=='000',na.rm=T),
    xxx    = sum(x[-1]=='xxx',na.rm=T),
    Others = sum(x[-1]!='000' & x[-1]!='xxx',na.rm=T))))

#    Name X000 xxx Others
# a1   a1    2   2      1
# a2   a2    1   1      1
# a3   a3    1   1      0

答案 2 :(得分:1)

如果我理解正确,并且任务是用xxx计算所有000!000&!xxxName,我们也可以使用base::table()来获得所需的输出:

df <- data.frame(Name = c("a1", "a1", "a2", "a3"),
                 per1 = c("000", "xxx", "000", "000"), 
                 per2 = c("xxx", "000", 340, "xxx"),
                 per3 = c(230, NA, "xxx", NA),
                 stringsAsFactors = F
                 )

Vals <- unlist(df[,-1])                                       # convert to the vector
Vals[!(Vals %in% c("000", "xxx")) & !is.na(Vals)] <- "Others" # !(xxx|000) <- Others
                                                              #
as.data.frame.matrix(                                         # group by Name, count
   table(rep(df$Name, ncol(df) - 1), Vals, useNA = "no")      # don't count NAs   
   )                                                          # convert to data.frame

#   000 Others xxx
#a1   2      1   2
#a2   1      1   1
#a3   1      0   1