如何在R中子集多列条件?

时间:2019-02-02 13:43:24

标签: r dataset subset tapply

全部

我的dataset如下所示。我正在尝试回答以下问题。

问题:

仅基于图纸数据,商店是否会销售一种纸类型(paper.type)比其他纸类型更多的单元(units.sold列)?

为回答上述问题,我使用了tapply函数,可以过滤两篇论文的数据。现在,我不确定如何继续进行操作以仅获取工程图数据。任何帮助表示赞赏!

我的代码

tapply(df$units.sold,list(df$paper,df$paper.type,df$store),sum)

数据集

             date year     rep     store paper          paper.type  unit.price   units.sold total.sale
9991  12/30/2015 2015     Ran    Dublin watercolor      sheet       0.77          5       3.85
9992  12/30/2015 2015     Ran    Dublin    drawing       pads      10.26          1      10.26
9993  12/30/2015 2015  Arijit  Syracuse watercolor        pad      12.15          2      24.30
9994  12/30/2015 2015  Thomas Davenport    drawing       roll      20.99          1      20.99
9995  12/31/2015 2015   Ruisi    Dublin watercolor      sheet       0.77          7       5.39
9996  12/31/2015 2015   Mohit Davenport    drawing       roll      20.99          1      20.99
9997  12/31/2015 2015    Aman  Portland    drawing       pads      10.26          1      10.26
9998  12/31/2015 2015 Barakat  Portland watercolor      block      19.34          1      19.34
9999  12/31/2015 2015  Yunzhu  Syracuse    drawing    journal      24.94          1      24.94
10000 12/31/2015 2015    Aman  Portland watercolor      block      19.34          1      19.34

注意:我是R的新手。请提供解释以及您的代码。

3 个答案:

答案 0 :(得分:3)

使用dplyr中的tidyverse,然后启动其filter函数。您可以使用%>%管道运算符将函数链接在一起。

df2 <- df %>% 
  filter(paper == "drawing") %>% 
  group_by(store, paper.type) %>% 
  summarise(units.sold = sum(units.sold))

  store     paper.type units.sold
  <chr>     <chr>           <dbl>
1 Davenport roll                2
2 Dublin    pads                1
3 Portland  pads                1
4 Syracuse  journal             1

答案 1 :(得分:1)

您可以基于aggregateunit.sold来提取store列中的paper.type

aggregate(units.sold~store+paper.type, df[df$paper == "drawing", ], sum)

#      store paper.type units.sold
#1  Syracuse    journal          1
#2    Dublin       pads          1
#3  Portland       pads          1
#4 Davenport       roll          2

这里,我们仅过滤paper类型的数据。根据此输出,我们可以比较每个units.soldstore的{​​{1}}的数量。

答案 2 :(得分:1)

我们可以使用data.table。使用setDT将'data.frame'转换为'data.table',并按'store''paper.type'分组,指定i表达式(paper == 'drawing')来对行进行子集化并通过获取sum来总结“ units.sold”

library(data.table)
setDT(df)[paper == "drawing", .(units.sold = sum(units.sold)), .(store, paper.type)]
#       store paper.type units.sold
#1:    Dublin       pads          1
#2: Davenport       roll          2
#3:  Portland       pads          1
#4:  Syracuse    journal          1

数据

df <-  structure(list(date = c("12/30/2015", "12/30/2015", "12/30/2015", 
"12/30/2015", "12/31/2015", "12/31/2015", "12/31/2015", "12/31/2015", 
"12/31/2015", "12/31/2015"), year = c(2015L, 2015L, 2015L, 2015L, 
2015L, 2015L, 2015L, 2015L, 2015L, 2015L), rep = c("Ran", "Ran", 
"Arijit", "Thomas", "Ruisi", "Mohit", "Aman", "Barakat", "Yunzhu", 
"Aman"), store = c("Dublin", "Dublin", "Syracuse", "Davenport", 
"Dublin", "Davenport", "Portland", "Portland", "Syracuse", "Portland"
), paper = c("watercolor", "drawing", "watercolor", "drawing", 
"watercolor", "drawing", "drawing", "watercolor", "drawing", 
"watercolor"), paper.type = c("sheet", "pads", "pad", "roll", 
"sheet", "roll", "pads", "block", "journal", "block"), unit.price = c(0.77, 
10.26, 12.15, 20.99, 0.77, 20.99, 10.26, 19.34, 24.94, 19.34), 
    units.sold = c(5L, 1L, 2L, 1L, 7L, 1L, 1L, 1L, 1L, 1L), total.sale = c(3.85, 
    10.26, 24.3, 20.99, 5.39, 20.99, 10.26, 19.34, 24.94, 19.34
    )), class = "data.frame", row.names = c("9991", "9992", "9993", 
"9994", "9995", "9996", "9997", "9998", "9999", "10000"))