总结R中的多个列(同时保留过滤器)

时间:2017-06-26 21:55:36

标签: r dplyr plyr

我用下面的代码打了一下砖墙。从本质上讲,dftable应该是一个包含窗口小部件点击的过滤数据框(我遍历每个窗口小部件的列)。

然后,我希望得到小部件处于活动状态的所有综合浏览量的总和(它不在所有页面上,我按此过滤以排除NA为NA的那些)。但是,dfviews只返回所有网页浏览量,而不是过滤小部件不是NA的位置。

任何指导意见将不胜感激: mixpanelData示例:

     --------------------------------------------------------------
     | Group | Date   | WidgetClick | Widget2Click | ViewedPageResult
     --------------------------------------------------------------
     | ABC  | 01/01/2017    | 123456      | NA          |  1450544
     --------------------------------------------------------------
     | ABN  | 01/01/2017    | NA          | 1245        |  4560000
     --------------------------------------------------------------
     | ABN  | 01/02/2017    | NA          | 1205        |  4561022
     --------------------------------------------------------------
     | BNN  | 01/02/2017    | 1044        | NA          |  4561021
     --------------------------------------------------------------

我的理想输出将是......(比例,我可以处理这些比例很好)

     WidgetClick CSV
     --------------------------------------------------------------
       Date       | WidgetClick | ViewedPageResult
     --------------------------------------------------------------
      01/01/2017    | 123455    |  1450544
     ------------------------------------------------------------
      01/02/2017    | 1044      |  4561021
     --------------------------------------------------------------

     WidgetClick 2 CSV
     --------------------------------------------------------------
     |Date       | Widget2Click | ViewedPageResult
     --------------------------------------------------------------
      01/01/2017    | 1245        |  4560000
     --------------------------------------------------------------
      01/02/2017    | 1205        |  4561022
     --------------------------------------------------------------

下面提供了代码......

vars = colnames(mixpanelData)
vars =vars[-c(1,2)]
k = 1
for (v in vars) {
    filename <- paste(v,k,".csv",sep="")
    dftable <- mixpanelData %>% filter(!is.na(v)) %>% group_by(Date) %>% summarise_(clicksum=interp(~sum(var, na.rm = TRUE), var = as.name(v)))

dfviews <- mixpanelData %>% filter(!is.na(v)) %>% group_by(Date) %>% summarise(viewsum=sum((ViewedPageResult)))
total <- merge(dftable,dfviews,by="Date")
total <- mutate(total, proportion = clicksum / viewsum * 100)
   write.csv(total, file = filename,row.names=FALSE, na="")
   k <- k +1 }

1 个答案:

答案 0 :(得分:0)

在您想要的结果中,您会显示两个单独的表格。但是你也提到你有几个小部件,所以单独的表可能不太理想。我将展示如何获得单独的表格,然后我将展示如何一次性计算所有小部件。

单独的表格

使用dplyrtidyr,您可以使用过滤器来获取您的两个表:

library(dplyr);library(tidyr)
df <- read.table(text="Group  Date    WidgetClick  Widget2Click  ViewedPageResult
ABC   01/01/2017     123456       NA            1450544
ABN   01/01/2017     NA           1245          4560000
ABN   01/02/2017     NA           1205          4561022
BNN   01/02/2017     1044         NA            4561021",header=TRUE,
stringsAsFactors=FALSE)

df%>% filter(!is.na(WidgetClick)) %>% select(-Widget2Click)
  Group       Date WidgetClick ViewedPageResult
1   ABC 01/01/2017      123456          1450544
2   BNN 01/02/2017        1044          4561021

df%>% filter(!is.na(Widget2Click)) %>% select(-WidgetClick)
  Group       Date Widget2Click ViewedPageResult
1   ABN 01/01/2017         1245          4560000
2   ABN 01/02/2017         1205          4561022

单人表

要在单个表中获得所有结果,首先需要gather Widget * Click列,然后filter

df%>%
  gather(Widget_number,Click,starts_with("Widget"))%>%
  filter(!is.na(Click)) 

  Group       Date ViewedPageResult Widget_number  Click
1   ABC 01/01/2017          1450544   WidgetClick 123456
2   BNN 01/02/2017          4561021   WidgetClick   1044
3   ABN 01/01/2017          4560000  Widget2Click   1245
4   ABN 01/02/2017          4561022  Widget2Click   1205

修改

summarise每个小部件每月的点击次数,您可以mutate使用Year_mon包中的as.yearmon添加zoo列。然后,group_by Widget_numberYear_month,然后summarise获取每月的总点击次数。您可以在summarise语句中执行其他计算,例如比例。我假设日期是&#34;%m /%d /%Y&#34;。确保情况确实如此。

library(zoo)
df%>%
  gather(Widget_number,Click,starts_with("Widget"))%>%
  filter(!is.na(Click)) %>%
  mutate(Year_month=as.yearmon(as.Date(Date,"%m/%d/%Y"))) %>%
  group_by(Widget_number,Year_month) %>%
  summarise(Sum_clicks=sum(Click,na.rm=TRUE))

  Widget_number    Year_month Sum_clicks
          <chr> <S3: yearmon>      <int>
1  Widget2Click      Jan 2017       2450
2   WidgetClick      Jan 2017     124500