R Tidyverse:在多个条件下过滤

时间:2020-06-03 17:28:47

标签: r filter dplyr

如何告诉R(dplyr)“重置”过滤器,这将允许我在同一管道内进行第二次过滤? 否则,我将不得不为每个标识符编号编写一个“ for-loop”。 最小的工作示例突出了我所面临的问题。

library(tidyverse)

data.tibble <- tribble(                      # sample data
  ~id,~year, ~identifier, ~items, ~cost,
  10, 2018, "aaca" , 10, 25, # "aaca" toy cars
  20, 2018, "aaca" , 12, 28, # "aaca" toy cars
  10, 2018, "bbda" , 14, 30, # "bbda" pens 
  20, 2018, "bbda" , 27, 29, # "bbda" pens
)

a <-data.tibble %>%                        # FIRST BLOCK WORKS FINE on its own 
  group_by(id, year) %>% 
  filter(str_detect(identifier, "^a")) %>% # lookks for identifier that begins
  summarise(toycars_sold=sum(items),       # with "a" 
            toycars_cost=sum(cost)) 
a 

b <- data.tibble %>%                       # Second block works fine on its own
  group_by(id, year) %>% 
  filter(str_detect(identifier,"^b")) %>% 
  summarise(pens_sold=sum(items),
            pens_cost=sum(cost))
b

我遇到麻烦,如果我要求dplyr再次过滤同一管道内的其他标识符,则会收到一条错误消息

data.tibble %>% 
  group_by(id, year) %>% 
  filter(str_detect(identifier, "^a")) %>% 
  summarise(toycars_sold=sum(items),
            toycars_cost=sum(cost)) %>% 
  filter(str_detect(identifier,"^b")) %>% 
  summarise(pens_sold=sum(items),
            pens_cost=sum(cost))


What i would like to end up with is

c <- full_join(a,b)

There are a myriad of codes ("identifiers") that I will have to go through ( sometimes there is more than one identifier for a single item. 

R然后告诉我,找不到对象“标识符”。

我们非常感谢您的帮助。

旧问题,有点难以理解

我确实有一个问题,我似乎无法全神贯注。这是我的问题,在调用第一个summary()函数之后,如何告诉tidyverse重置过滤器。否则,我将不得不为要过滤的每个“ id-code”(我相信正则表达式是正确的术语)创建一个“ for-loop”。

output <- vector("list") # object to store output in 

for (i in seq_along(object18)) { # object (list) to loop over, here items of stores in yr 18 
  output[[i]] <- object18[[i]] %>% 
    group_by(storeid, month, year, quarter) %>%  # var list to group over
    filter(str_detect(itemcode, "^CODE")) %>%   # Code equals some identifiernr ("string")
    summarize(toys=sum(items), # summarize
              max.items.sold=max(items)) # summarize %>%
    filter(str_detect(itemcode, "^NEWCODE, possibly multiple codes) %>% # FILTER OVER NEW CODE DOESN'T WORK
    summarize(toys2=sum(items), # summarize
             (itemstoy2=max(items)) # summarize 
}

有人对实现我的目标有想法吗?

请不要对我苛刻,我是R的新手。

提前谢谢戴维。

1 个答案:

答案 0 :(得分:0)

无法“回滚” filter并返回到管道中的原始未过滤数据。可能可以实现这种功能,但是,tidyverse中有更好的选择来实现相同的输出。

对于这种问题,我会:

  1. 定义一个自定义函数,该函数将data.frame和您的正​​则表达式过滤器(作为字符串)作为参数,并返回soldcosts的总和。

  2. 定义一个命名矢量,其中将商品名称作为名称,将正则表达式过滤器作为值。

  3. 将现有数据包装在tibble内的列表中,并与2中的向量进行交叉,然后将向量名称添加为新列。

  4. 将{1.}中定义的自定义函数应用于map2,以生成过滤后的数据集。

  5. 选择“(名称)名称”列,然后选择包含过滤数据且没有嵌套的列。

现在,您可以使用长格式的数据。对于许多任务而言,这已经是一种很好的格式。在最后一步中,您可以通过...将其设置为所需的格式。

  1. ...使用pivot_wider

如果您要过滤的不仅仅是正则表达式,则需要创建一个表达式列表(而不是字符向量),并使用此过滤器列表进行修饰。

library(tidyverse)

data.tibble <- tribble(                      # sample data
  ~id,~year, ~identifier, ~items, ~cost,
  10, 2018, "aaca" , 10, 25, # "aaca" toy cars
  20, 2018, "aaca" , 12, 28, # "aaca" toy cars
  10, 2018, "bbda" , 14, 30, # "bbda" pens 
  20, 2018, "bbda" , 27, 29, # "bbda" pens
)

sum_filter <- function(.df, .filter) {

  .df %>% 
    group_by(id, year) %>% 
    filter(str_detect(identifier, .filter)) %>%
    transmute(sold = sum(items),
              cost = sum(cost))

}

filter_vec <- c("toycars" = "^a",
                "pens" = "^b")

tibble(data = list(data.tibble)) %>%
  crossing(filters = filter_vec) %>% 
  mutate(name = names(filter_vec),
         filtered_data = map2(data, filters, sum_filter)) %>% 
  select(name, filtered_data) %>% 
  unnest(cols = filtered_data) %>% 
  pivot_wider(names_from = name,
              values_from = c(sold, cost))

#> # A tibble: 2 x 6
#>      id  year sold_toycars sold_pens cost_toycars cost_pens
#>   <dbl> <dbl>        <dbl>     <dbl>        <dbl>     <dbl>
#> 1    10  2018           10        14           25        30
#> 2    20  2018           12        27           28        29

reprex package(v0.3.0)于2020-06-04创建