在分组数据中创建新变量

时间:2013-09-01 21:57:21

标签: r loops plyr

我的数据如下所示:(基本上可以说是通过客户旅行销售不同品牌,空白意味着品牌不是在特定旅行中为客户购买,商店是指商店购买地点)

customerid  date    store   brand1  brand2  brand3  brand4
1   01-03-2012  a    $3.00   $-      $-      $2.00 
1   06-03-2012  a    $2.00   $-      $-      $3.00 
1   11-03-2012  b    $2.00   $1.00   $1.00   $1.00 
1   26-03-2012  a    $2.00   $-      $-      $-   
2   16-03-2012  d    $2.00   $1.00   $1.00   $2.00 
2   21-03-2012  a    $-      $-      $1.00   $2.00 
2   26-03-2012  a    $2.00   $1.00   $3.00   $1.00 

我想为每个品牌创建单独的数据框,只包含该品牌的销售额> 0的行,所以我认为..我可以将brand1-brand4放在名为colnames_df的列表中,如下所示:

 colnames_df<- colnames(myDf)

一旦我这样做,我可以遍历此循环的内容以生成品牌级数据集...从上面的数据,我需要4个独立的数据集与相关的品牌列和其他列的custID,仅日期..像下面的4个数据集是我想要的:

品牌1的数据集:(预期产出)

customerid  date    store   brand1
1   01-03-2012  a    $3.00 
1   06-03-2012  a    $2.00 
1   11-03-2012  b    $2.00 
1   26-03-2012  a    $2.00 
2   16-03-2012  d    $2.00 
2   26-03-2012  a    $2.00 

品牌2的数据集:(预期产出)

   customerid   store   date    brand2
1   b   11-03-2012   $1.00 
2   d   16-03-2012   $1.00 
2   a   26-03-2012   $1.00 

同样会有Brand3和4的数据帧...对于这部分,我应该写一些类似于(i in length(colnames_df){paste(“Brand”,i)&lt; - } ...不知道如何写这个..我需要从上面的原始数据创建品牌级数据框...如果我使用lapply和这样的功能,我能够弄清楚如何获得列表/数据框与结果数据中的所有列。我如何做我上面需要做的......

除此之外,我还有另外一项要求:

如果创建了品牌级数据集,我还需要在每个品牌级数据集上创建滞后,计数器变量如下所示...

  1. 步骤1:为每个客户旅行创建计数器变量(在数据集按custID和日期排序之后)......
  2. brand1(带计数器)的预期输出:

    我使用的代码(我很难将这个代码放在一个循环中,这样创建的每个品牌级数据集都会自动创建新变量..而不是下面的brand1,它应该自动成为brand1,2,3,4等)

    brand1$counter <- with(brand1, ave(customerID, customerID, FUN = seq_along))
    
    customerid  date    store   brand1  counter_custtrip
    1   01-03-2012  a    $3.00  1
    1   06-03-2012  a    $2.00  2
    1   11-03-2012  b    $2.00  3
    1   26-03-2012  a    $2.00  4
    2   16-03-2012  d    $2.00  1
    2   26-03-2012  a    $2.00  2
    

    2.step2:创建一个滞后变量....就像下面的预期输出一样。

    我可以使用这样的代码:(我的问题是我可以按照数据集分别执行这些操作,但是我如何以这样的方式执行它,以便在每个品牌级数据集创建时发生所有这些... ??? )

    ddply(.data = df, .variables = .(customerID), mutate,
       lagdate = c(NA, head(date, -1))
    

    预期输出为:(对于brand1数据集)

      customerid    date    store   brand1  counter_custtrip    laggedtripdate
    1   01-03-2012  a    $3.00  1   -
    1   06-03-2012  a    $2.00  2   01-03-2012
    1   11-03-2012  b    $2.00  3   06-03-2012
    1   26-03-2012  a    $2.00  4   11-03-2012
    2   16-03-2012  d    $2.00  1   -
    2   26-03-2012  a    $2.00  2   16-03-2012
    
    1. 第3步:在商店之间创建行程之间的天数
    2. 查看brand1的预期输出(同样适用于所有品牌)

      customerid  date    store   brand1  counter_custtrip    laggedtripdate  daysbetweentrips
      1   01-03-2012  a    $3.00  1   -   -
      1   06-03-2012  a    $2.00  2   01-03-2012  5
      1   11-03-2012  b    $2.00  3       -
      1   26-03-2012  a    $2.00  4   06-03-2012  20
      2   16-03-2012  d    $2.00  1   -   -
      2   26-03-2012  a    $2.00  2   16-03-2012  -
      

      正如我们所看到的,CustomerID 1已经存储了3/1,然后是5天后的3/6,然后是20天后的3/26 ..这是逻辑..我该怎么办每个商店的每个客户都有这个......

      我知道有很多,我几乎就在那里,我只需要几行建议如何将整个结构放在一起,这样我就可以把它放在一个新的品牌级别数据集的循环中创建并且每个都在数据框创建过程中创建了所有新变量....

      让我知道我错过了什么

2 个答案:

答案 0 :(得分:1)

尝试以下答案,该答案转换为长格式并使用data.table

library(data.table)

# Your data:
data <- structure(list(customerid = c(1L, 1L, 1L, 1L, 2L, 2L, 2L), date = structure(c(1325566800, 
1338696000, 1351915200, 1332734400, 1331870400, 1332302400, 1332734400
), class = c("POSIXct", "POSIXt"), tzone = ""), store = c("a", 
"a", "b", "a", "d", "a", "a"), brand1 = c(3L, 2L, 2L, 2L, 2L, 
NA, 2L), brand2 = c(NA, NA, 1L, NA, 1L, NA, 1L), brand3 = c(NA, 
NA, 1L, NA, 1L, 1L, 3L), brand4 = c(2L, 3L, 1L, NA, 2L, 2L, 1L
)), .Names = c("customerid", "date", "store", "brand1", "brand2", 
"brand3", "brand4"), row.names = c(NA, -7L), class = c("data.table", 
"data.frame"))

# Convert from wide format to long, and subset to records with sales > 0:
data.long<-data.table(data[,list(customerid,store,date,laggedtripdate=as.POSIXct(NA))], brand=names(data)[4:7], sales=c(t(as.matrix(data[,4:7,with=F]))),key=c("customerid","date"))[sales>0]

# Add the lagged date, by customerid:
data.long[data.long[,.N,by=list(customerid,date)][,laggedtripdate:=c(as.POSIXct(NA),date),by=customerid],laggedtripdate:=i.laggedtripdate]

# Add daysbetweentrips:
data.long[,daysbetweentrips:=date-laggedtripdate]

# Add counter_custtrip:
data.long[,counter_custtrip:=1:.N,by=list(customerid,brand)]

# Subset of results for brand==1:
data.long[brand=="brand1"]
#   customerid store       date laggedtripdate  brand sales daysbetweentrips counter_custtrip
#1:          1     a 2012-01-03           <NA> brand1     3          NA days                1
#2:          1     a 2012-03-26     2012-01-03 brand1     2    82.95833 days                2
#3:          1     a 2012-06-03     2012-03-26 brand1     2    69.00000 days                3
#4:          1     b 2012-11-03     2012-06-03 brand1     2   153.00000 days                4
#5:          2     d 2012-03-16           <NA> brand1     2          NA days                1
#6:          2     a 2012-03-21     2012-03-16 brand1     2     5.00000 days                2

答案 1 :(得分:0)

以下是长数据帧格式的数据示例。

library(reshape2)
library(plyr)


# Prepare data
# melt data
# measured variables given as a vector of variable names
df2 <- melt(data = df,
            measure.vars = paste0("brand", 1:4),
            variable.name = "brand",
            value.name = "sale")

在@ kaos1511

的评论后更新了melt
# handling brand names that are not on the form brand1, brand2, brandn"

# add some fake brand names to df
names(df) <- c("customerid", "date", "store", "Mazda", "Toyota", "Nissan", "Volvo")

    # If data for different brands always come after customerid, date and store
# you can melt data by specifying 'measure variables' by position, like this
# melt data
df2 <- melt(data = df,
            measure.vars = 4:(ncol(df)),
            variable.name = "brand",
            value.name = "sale")

# alternatively, you can specify customerid, date and store as 'id variables'
# melt will then assume that all remainding variables, i.e. all 'brand columns', are measure variables
df2 <- melt(data = df,
            id.vars = c("customerid", "date", "store"),
            variable.name = "brand",
            value.name = "sale")

# remove $ and replace -
df2$sale <- with(df2, gsub(pattern = "$", replacement = "", sale, fixed = TRUE))
df2$sale[df2$sale == "-"] <- 0

# convert to date 
df2$date <- as.Date(df2$date, format = "%d-%m-%Y")

# select rows with sale > 0
df3 <- df2[df2$sale > 0, ]


# Create new variables
# per brand and customerid, create counter and lagdate
# nb, in your last two 'expected output', lagdate does not match.
# my lagdate matches the first of them.
df4 <- ddply(.data = df3, .variables = .(brand, customerid), mutate,
             counter = as.numeric(as.factor(date)),
             lagdate = c(NA, as.character(head(date, -1))))

# order by brand, store and date
df4 <- arrange(df4, brand, store, date)

# per brand and store, calculate days between trips
df5 <- ddply(.data = df4, .variables = .(brand, store), mutate,
             daysbetweentrips = c(NA, diff(date)))

# order by brand, customerid and date
df5 <- arrange(df5, brand, customerid, date)