Question

我有一个如下所示的数据集:(有大约500列）

custid  date    store   abc efg hij klm …   xyz
1   1-Feb-13    a   2   0   2   1       1
1   5-Feb-13    c   0   3   3   0       0
1   9-Feb-13    a   3   3   0   0       1
1   31-Mar-13   a   3   0   0   0       0

可以看出，abc，efg，hij是正在销售的产品的名称....每个产品有1个这样的列，因此表示产品销售的总＃列是500 ....它有每次客户旅行的不同产品的销售......

我最需要的是创建500个数据框（不是列表），这样每个数据框都将包含该产品的列和其他常见列，如CustID，Date，store ......所以productabc的数据框架将仅包含以下列：

custid  date    store   abc
1   1-Feb-13    a   2
1   5-Feb-13    c   0
1   9-Feb-13    a   3
1   31-Mar-13   a   3

除此之外，我还需要过滤那些包含该产品的＆gt; 0值的行...以上将转换为：

custid  date    store   abc
1   1-Feb-13    a   2
1   9-Feb-13    a   3
1   31-Mar-13   a   3

我以为我会将产品名称abc，efg等放在一个列表中并循环遍历它，同时我还在每个产品数据集上创建新变量...所以，我还需要一个滞后变量，一天每个产品级别数据集之间的行程变量....我想这样做，在一个for循环中，我可以生成产品级别的数据集...像下面的东西（它不是像R一样，但是请帮忙）

colnames_df<-colnames(df[(c(4:500)]---- This will have the product names in a dataframe/list called colnames_df

for（i in 1 to nrow（colnames_df）{paste（“category”，i）＆lt; -df [，i＆gt; 0]

然后，我想以这样的方式遍历这个colnames_df：当循环开始时，第一个产品的数据集，即abc应该如上所述创建，等等......当abc是创建，应该还有滞后变量，行程之间的天数可以通过商店变化。我怎么做？我想在这里广泛利用循环..（请参阅下面每个产品级数据框的预期最终输出...）

custid  date    store   abc lagdate daysbetweentrips
1   1-Feb-13    a   2       -
1   5-Feb-13    a   3   1-Feb-13    4
1   31-Mar-13   a   3   5-Feb-13    26

我一直在回答我的问题，但不知何故无法直接解决这个问题。任何帮助表示赞赏...

谢谢！

Answer 1

这样的事情：

dat <- read.table(text="custid  date    store   abc efg hij klm   xyz
1   1-Feb-13    a   2   0   2   1       1
1   5-Feb-13    c   0   3   3   0       0
1   9-Feb-13    a   3   3   0   0       1
1   31-Mar-13   a   3   0   0   0       0", header=TRUE, stringsAsFactors=FALSE)

#Locale needs to be English to parse month names:
Sys.setlocale(category = "LC_TIME", locale = "English")
dat$date <- as.Date(dat$date, format="%d-%b-%y")

#reshape to long format
library(reshape2)
dat <- melt(dat, id.vars=c("custid", "date","store"))

#subset
dat <- dat[dat$value>0,]

#calculate days between trips per customer. 
library(plyr)
dat <- ddply(dat, .(custid, variable), transform, daysbetweentrips=c(NA,diff(date)))

#I doubt the following is usefull:
dats <- by(dat, dat$variable, function(df) df)

dats

# dat$variable: abc
#   custid       date store variable value daysbetweentrips
# 1      1 2013-02-01     a      abc     2               NA
# 2      1 2013-02-09     a      abc     3                8
# 3      1 2013-03-31     a      abc     3               50
# ------------------------------------------------------------------------------------------------------- 
#   dat$variable: efg
#   custid       date store variable value daysbetweentrips
# 4      1 2013-02-05     c      efg     3               NA
# 5      1 2013-02-09     a      efg     3                4
# ------------------------------------------------------------------------------------------------------- 
#   dat$variable: hij
#   custid       date store variable value daysbetweentrips
# 6      1 2013-02-01     a      hij     2               NA
# 7      1 2013-02-05     c      hij     3                4
# ------------------------------------------------------------------------------------------------------- 
#   dat$variable: klm
#   custid       date store variable value daysbetweentrips
# 8      1 2013-02-01     a      klm     1               NA
# ------------------------------------------------------------------------------------------------------- 
#   dat$variable: xyz
#    custid       date store variable value daysbetweentrips
# 9       1 2013-02-01     a      xyz     1               NA
# 10      1 2013-02-09     a      xyz     1                8

Answer 2

使用此：

for(i in names(df)[-(1:3)])
    assign(i, with(df[order(df$store, df$date), c("custid","date","store",i)], {
        lagdate <- c(NA, head(date,-1));
        daysbetweentrips <- date - lagdate
    }))

这将创建数据框，其名称由数据框的列名称给出（从第4列开始）。

如果您想要名为“Category_X”的数据帧，对于X，从1到497不等，您可以使用：

categ <- names(df)[-(1:3)]
for(i in seq_along(categ))
    assign(paste0("Category_",i), with(df[order(df$store, df$date), c(1:3,i+3)], {
        lagdate <- c(NA, head(date,-1));
        daysbetweentrips <- date - lagdate
    }))

也就是说，列表会更好......更好： - ）

大数据的多个数据帧

2 个答案: