我刚刚开始学习Python和R,因此,使用它们中的任何一个的建议都将不胜感激。
我的数据存储在两个数据框中。一个是销售数据,对于每个消费者,我们可以看到他购买商品的日期。同一个人可能多次购物:
Date Person ID Product
01-05-2012 1 cereal
01-05-2012 2 apple
02-08-2012 3 beef
03-22-2013 72 pot
07-19-2012 1 cake
第二个数据框具有成员资格数据,该数据可以告诉我们某个人何时加入该计划:
Date Person ID Type Status
06-11-2008 1 Gold New
10-12-2011 2 Gold New
02-08-2011 3 Silver Renewal
02-01-2012 72 Gold Renewal
03-22-2012 1 Gold Renewal
对于同一个人,我想做的是,一个人在购买程序之前要花多长时间才能购买商品。
例如,第1个人在2008年11月11日获得了新的会员资格,并在2012年1月5日购买了谷物。我想计算这两个日期之间有多少天。
但是,这些信息存储在单独的数据框中。我不认为它们可以追加或合并到一个数据框中,因为一个人可以在一个或两个数据框中拥有多个观察值。
我在想的是,从销售数据中提取所有日期,然后从许可证数据中提取所有日期。然后将这两个新数据框合并为一个新数据框。这会给我:
License Date Person ID Sales Date
06-11-2008 1 01-05-2012
10-12-2011 2 01-05-2012
02-08-2011 3 02-08-2011
02-01-2012 72 03-22-2013
06-11-2008 1 07-19-2012
03-22-2012 1 01-05-2012
03-22-2012 1 07-19-2012
但是这里的问题是,如果一个人有两个许可日期(例如一个新许可和一个续约许可),那么合并数据将得到2 *(销售日期)...但是我只想要一个许可日期有效的许可证。
例如,第1个人在2012年1月5日使用许可证06-11-2008购买谷物,在2012年7月19日使用许可证03-22-2012购买谷物。但是合并数据框会给我4条记录,而不是我想要的2条记录...
我想要的结果是在他获得用于该次购买的许可证后,每次购买的时间:
License Date Person ID Sales Date TimeToPurchase
06-11-2008 1 01-05-2012 ...
10-12-2011 2 01-05-2012 ...
02-08-2011 3 02-08-2011 ...
02-01-2012 72 03-22-2013 ...
03-22-2012 1 07-19-2012 ...
您有建议我做的更好的方法吗?
非常感谢您的帮助!
答案 0 :(得分:2)
首先,您需要将日期保存为日期时间,您可以这样完成:
sales['Date'] = pd.to_datetime(sales['Date'])
memberships['Date'] = pd.to_datetime(memberships['Date'])
然后您用Person ID
合并它们,并得到具有重复项的格式。
m = sales.merge(memberships, left_on='Person ID', right_on='Person ID',
suffixes=('_sales', '_memberships'))
m
Date_sales Person ID Product Date_memberships Type Status
0 2012-01-05 1 cereal 2008-06-11 Gold New
1 2012-01-05 1 cereal 2012-03-22 Gold Renewal
2 2012-07-19 1 cake 2008-06-11 Gold New
3 2012-07-19 1 cake 2012-03-22 Gold Renewal
4 2012-01-05 2 apple 2011-10-12 Gold New
5 2012-02-08 3 beef 2011-02-08 Silver Renewal
6 2013-03-22 72 pot 2012-02-01 Gold Renewal
现在您可以像这样计算销售和会员日期之间的经过天数:
m['TimeToPurchase'] = (m['Date_sales'] - m['Date_memberships']).dt.days
m
Date_sales Person ID Product Date_memberships Type Status TimeToPurchase
0 2012-01-05 1 cereal 2008-06-11 Gold New 1303
1 2012-01-05 1 cereal 2012-03-22 Gold Renewal -77
2 2012-07-19 1 cake 2008-06-11 Gold New 1499
3 2012-07-19 1 cake 2012-03-22 Gold Renewal 119
4 2012-01-05 2 apple 2011-10-12 Gold New 85
5 2012-02-08 3 beef 2011-02-08 Silver Renewal 365
6 2013-03-22 72 pot 2012-02-01 Gold Renewal 415
在这里,您可以首先消除负面因素,然后为每个人员ID和日期销售获得最低TimeToPurchase
。
m = m[m['TimeToPurchase'] >= 0]
keep = m.groupby(['Person ID', 'Date_sales'], as_index=False)['TimeToPurchase'].min()
keep
Person ID Date_sales TimeToPurchase
1 2012-01-05 1303
1 2012-07-19 119
2 2012-01-05 85
3 2012-02-08 365
72 2013-03-22 415
这些是您要保留在合并表中的记录,您可以使用内部联接对其进行过滤:
result = m.merge(keep,
left_on=['Person ID', 'Date_sales', 'TimeToPurchase'],
right_on=['Person ID', 'Date_sales', 'TimeToPurchase'])
result
Date_sales Person ID Product Date_memberships Type Status TimeToPurchase
2012-01-05 1 cereal 2008-06-11 Gold New 1303
2012-07-19 1 cake 2012-03-22 Gold Renewal 119
2012-01-05 2 apple 2011-10-12 Gold New 85
2012-02-08 3 beef 2011-02-08 Silver Renewal 365
2013-03-22 72 pot 2012-02-01 Gold Renewal 415
与上述逻辑相同,因此我将粘贴代码。
# Date types
sales[, Date := as.Date(Date, format = "%m-%d-%Y")]
memberships[, Date := as.Date(Date, format = "%m-%d-%Y")]
# Merge
m <- memberships[sales, on = "Person ID"]
# Calculate elapsed days
m[, TimeToPurchase := as.numeric(m$i.Date - m$Date)]
# Eliminate negatives
m <- m[TimeToPurchase >= 0]
# Calculate records to keep
keep <- m[, .(TimeToPurchase = min(TimeToPurchase)), by = .(`Person ID`, i.Date)]
# Filter result
result <- m[keep, on = c("Person ID", "i.Date", "TimeToPurchase")]
result
Date Person ID Type Status i.Date Product TimeToPurchase
1: 2008-06-11 1 Gold New 2012-01-05 cereal 1303
2: 2011-10-12 2 Gold New 2012-01-05 apple 85
3: 2011-02-08 3 Silver Renewal 2012-02-08 beef 365
4: 2012-02-01 72 Gold Renewal 2013-03-22 pot 415
5: 2012-03-22 1 Gold Renewal 2012-07-19 cake 119
答案 1 :(得分:1)
以下是使用R和library(data.table)
的解决方案。假设您只对最近的购买时间感兴趣:
编辑:问题更新后
library(data.table)
purchaseDT <- data.table(stringsAsFactors=FALSE,
Date = c("01-05-2009", "01-05-2012", "02-08-2012", "03-22-2013"),
PersonID = c(1, 2, 1, 72),
Product = c("cereal", "apple", "beef", "pot")
)
programDT <- data.table(stringsAsFactors=FALSE,
Date = c("06-11-2008", "10-12-2011", "02-08-2011", "02-01-2012"),
PersonID = c(1, 2, 1, 72),
Type = c("Gold", "Gold", "Silver", "Gold"),
Status = c("New", "New", "Renewal", "Renewal")
)
purchaseDT[, PurchaseDate := as.Date(Date, format="%m-%d-%Y")]
programDT[, LicenseDate := as.Date(Date, format="%m-%d-%Y")]
purchaseDT[, Date := NULL]
programDT[, Date := NULL]
mergedDT <- purchaseDT[programDT, on="PersonID"]
mergedDT[, TimeToPurchase := PurchaseDate-LicenseDate]
mergedDT <- mergedDT[TimeToPurchase > 0]
resultDT <- mergedDT[, .(TimeToPurchase = min(TimeToPurchase)), by = c("LicenseDate", "PersonID")]
resultDT[, PurchaseDate := LicenseDate+TimeToPurchase]
print(resultDT)
结果:
LicenseDate PersonID TimeToPurchase PurchaseDate
1: 2008-06-11 1 208 days 2009-01-05
2: 2011-10-12 2 85 days 2012-01-05
3: 2011-02-08 1 365 days 2012-02-08
4: 2012-02-01 72 415 days 2013-03-22
答案 2 :(得分:0)
这是您的一个主意。首先,我使用Person_ID
和Date
合并了两个数据集。在此示例中,我需要在第一个mutate()
中创建一个日期对象(即Date)。我按Person_ID
和Date
对数据进行了排序。然后,我创建了一个新的分组变量。我所做的是确定其中Status
是“ New”或“ Renewal”的行。这意味着我确定了许可证首次生效的时间。该行成为每个许可证的第一行。对于每个group
,我选择前两行。数据按Person_ID
和Date
排列,因此第二行应该是客户首次使用有效许可证购买的东西。最后,我使用time2purchase
计算了间隔(即Date
)。
full_join(df1, df2, by = c("Person_ID", "Date")) %>%
mutate(Date = as.Date(Date, format = "%m-%d-%Y")) %>%
arrange(Person_ID, Date) %>%
mutate(group = findInterval(x = 1:n(), vec = grep(Status, pattern = "New|Renewal"))) %>%
group_by(group) %>%
slice(1:2) %>%
summarize(time2purchase = Date[2]-Date[1])
group time2purchase
<int> <time>
1 1 1303 days
2 2 119 days
3 3 85 days
4 4 365 days
5 5 415 days
To make things clearer, I leave the results below, which you can generate
using mutate() instead of summarize().
Date Person_ID Product Type Status group time2purchase
<date> <int> <chr> <chr> <chr> <int> <time>
1 2008-06-11 1 NA Gold New 1 1303 days
2 2012-03-22 1 NA Gold Renewal 2 119 days
3 2011-10-12 2 NA Gold New 3 85 days
4 2011-02-08 3 NA Silver Renewal 4 365 days
5 2012-02-01 72 NA Gold Renewal 5 415 days
数据
df1 <-structure(list(Date = c("01-05-2012", "01-05-2012", "02-08-2012",
"03-22-2013", "07-19-2012"), Person_ID = c(1L, 2L, 3L, 72L, 1L
), Product = c("cereal", "apple", "beef", "pot", "cake")), class = "data.frame",
row.names = c(NA,
-5L))
df2 <- structure(list(Date = c("06-11-2008", "10-12-2011", "02-08-2011",
"02-01-2012", "03-22-2012"), Person_ID = c(1L, 2L, 3L, 72L, 1L
), Type = c("Gold", "Gold", "Silver", "Gold", "Gold"), Status = c("New",
"New", "Renewal", "Renewal", "Renewal")), class = "data.frame", row.names = c(NA,
-5L))