重塑数据框架连续几年

时间:2014-03-10 13:50:36

标签: r reshape riverplot

我有过去3年访问过商店的数千名客户的数据。 对于每个客户,我都有:

  • ID
  • 今年访问的一年和第一家商店的组合。
Customer_Id | Year_*_Store 
1            2010_A
1            2011_B
1            2012_C
2            2010_A
2            2011_B
2            2012_D

我想拥有的是以下数据结构,以便用河流图(又名Sankey图)可视化客户行为的演变

例如,2010年首次访问A店的2位客户于2011年首次访问了B店:

SOURCE |     TARGET |   NB_CUSTOMERS
2010_A      2011_B      2
2011_B      2012_C      1
2011_B      2012_D      1

我不希望两年之间的链接不连续,如2010_A和2012_D

如何在R中执行此操作?

2 个答案:

答案 0 :(得分:2)

我会用dplyr(更快)

来做这件事
df<-read.table(header=T,text="Customer_Id  Year_Store 
1            2010_A
1            2011_B
1            2012_C
2            2010_A
2            2011_B
2            2012_D")

require(dplyr)             # for aggregation
require(riverplot)         # for Sankey

targets<-
group_by(df,Customer_Id) %.%           # group by Customer
mutate(source=Year_Store,target=c(as.character(Year_Store)[-1],NA)) %.%   # add a lag to show the shift
filter(!is.na(target)) %.%                                                # filter out empty edges
regroup(list("source","target")) %.%                                      # regroup by source & target
summarise(len=length(Customer_Id)) %.%                                    # count customers for relationship
mutate(step=as.integer(substr(target,1,4))-as.integer(substr(source,1,4))) %.%   # add a step to show how many years
filter(step==1)                                                            # filter out relationships for non consec years

topnodes <- c(as.character(unique(df$Year_Store)))                         # unique nodes

nodes <- data.frame( ID=topnodes,                                          # IDs
                   x=as.numeric(substr(topnodes,1,4)),                     # x value for plot
                   col= rainbow(length(topnodes)),                         # color each different
                   labels= topnodes,                                       # labels
                   stringsAsFactors= FALSE )

edges<-                                                                    # create list of list 
  lapply(unique(targets$source),function(x){
      l<-as.list(filter(targets,source==x)$len)                            # targets per source
      names(l)<-filter(targets,source==x)$target                           # name of target
      l
  })

names(edges)<-unique(targets$source)                                       # name top level nodes

r <- makeRiver( nodes, edges)                                              # make the River 
plot( r )                                                                  # plot it!

enter image description here

答案 1 :(得分:1)

请注意,列名称中不能包含*(请参阅?make.names)。这是一个基本方法:

  1. Year_store拆分为数据框中的两个单独的列YearStore;目前它包含两种完全不同的数据,你实际上需要单独处理它们。

  2. 制作NextYear列,定义为Year + 1

  3. 制作一个NextStore列,您可以在其中指定与Customer_Id匹配的商店代码,并且Year与此行的NextYear相同,分配{ {1}}如果没有客户在明年访问商店的记录,并且如果数据不符合要求的规格则会抛出错误(明年第一次访问哪家商店时不明确)。

  4. 删除NANextStore的任何行,并将NANextYear列合并到NextStore列中。

  5. NextYear_NextStoreYear_store列汇总您的数据框,例如在NextYear_NextStore包中使用ddply

  6. 一些示例数据:

    plyr

    实施:

    # same example data as question
    customer.df <- data.frame(Customer_Id = c(1, 1, 1, 2, 2, 2),
        Year_Store = c("2010_A", "2011_B", "2012_C", "2010_A", "2011_B", "2012_D"),
        stringsAsFactors = FALSE)
    
    # alternative data should throw error, customer 2 is inconsistent in 2011
    badCustomer.df <- data.frame(Customer_Id = c(1, 1, 1, 2, 2, 2),
        Year_Store = c("2010_A", "2011_B", "2012_C", "2010_A", "2011_B", "2011_D"),
        stringsAsFactors = FALSE)
    

    结果:

    require(plyr)
    
    splitYearStore <-  function(df) {
        df$Year <- as.numeric(substring(df$Year_Store, 1, 4))
        df$Store <- as.character(substring(df$Year_Store, 6))
        return(df) 
    }
    
    findNextStore <- function(df, matchCust, matchYear) {
        matchingStore <- with(df,
            df[Customer_Id == matchCust & Year == matchYear, "Store"])
        if (length(matchingStore) == 0) {
            return(NA)
        } else if (length(matchingStore) > 1) {
            errorString <- paste("Inconsistent store results for customer",
                matchCust, "in year", matchYear)
            stop(errorString)
        } else {
            return(matchingStore)
        }
    }
    
    tabulateTransitions <-  function(df) {
        df <- splitYearStore(df)
        df$NextYear <- df$Year + 1
        df$NextStore <- mapply(findNextStore, matchCust = df$Customer_Id,
            matchYear = df$NextYear, MoreArgs = list(df = df)) 
        df$NextYear_NextStore <- with(df, paste(NextYear, NextStore, sep = "_"))
        df <- df[!is.na(df$NextStore),]
        df <- ddply(df, .(Source = Year_Store, Target = NextYear_NextStore),
            summarise, No_Customers = length(Customer_Id))
        return(df) 
    }
    

    没有尝试进行优化;如果您的数据集很大,那么您可能应该调查> tabulateTransitions(customer.df) Source Target No_Customers 1 2010_A 2011_B 2 2 2011_B 2012_C 1 3 2011_B 2012_D 1 > tabulateTransitions(badCustomer.df) Error in function (df, matchCust, matchYear) : Inconsistent store results for customer 2 in year 2011 解决方案。