在整个DataFrame中减去日期

时间:2019-03-15 15:25:48

标签: python r

我刚刚开始学习Python和R,因此,使用它们中的任何一个的建议都将不胜感激。

我的数据存储在两个数据框中。一个是销售数据,对于每个消费者,我们可以看到他购买商品的日期。同一个人可能多次购物:

Date             Person ID      Product       
01-05-2012       1              cereal
01-05-2012       2              apple
02-08-2012       3              beef
03-22-2013       72             pot
07-19-2012       1              cake

第二个数据框具有成员资格数据,该数据可以告诉我们某个人何时加入该计划:

Date             Person ID      Type      Status      
06-11-2008       1              Gold      New
10-12-2011       2              Gold      New    
02-08-2011       3              Silver    Renewal
02-01-2012       72             Gold      Renewal
03-22-2012       1              Gold      Renewal

对于同一个人,我想做的是,一个人在购买程序之前要花多长时间才能购买商品。

例如,第1个人在2008年11月11日获得了新的会员资格,并在2012年1月5日购买了谷物。我想计算这两个日期之间有多少天。

但是,这些信息存储在单独的数据框中。我不认为它们可以追加或合并到一个数据框中,因为一个人可以在一个或两个数据框中拥有多个观察值。

我在想的是,从销售数据中提取所有日期,然后从许可证数据中提取所有日期。然后将这两个新数据框合并为一个新数据框。这会给我:

License Date     Person ID      Sales Date            
06-11-2008       1              01-05-2012      
10-12-2011       2              01-05-2012         
02-08-2011       3              02-08-2011    
02-01-2012       72             03-22-2013
06-11-2008       1              07-19-2012 
03-22-2012       1              01-05-2012
03-22-2012       1              07-19-2012    

但是这里的问题是,如果一个人有两个许可日期(例如一个新许可和一个续约许可),那么合并数据将得到2 *(销售日期)...但是我只想要一个许可日期有效的许可证。

例如,第1个人在2012年1月5日使用许可证06-11-2008购买谷物,在2012年7月19日使用许可证03-22-2012购买谷物。但是合并数据框会给我4条记录,而不是我想要的2条记录...

我想要的结果是在他获得用于该次购买的许可证后,每次购买的时间:

License Date     Person ID      Sales Date   TimeToPurchase         
06-11-2008       1              01-05-2012      ...
10-12-2011       2              01-05-2012      ...
02-08-2011       3              02-08-2011      ...
02-01-2012       72             03-22-2013      ...
03-22-2012       1              07-19-2012      ...

您有建议我做的更好的方法吗?

非常感谢您的帮助!

3 个答案:

答案 0 :(得分:2)

熊猫

首先,您需要将日期保存为日期时间,您可以这样完成:

sales['Date'] = pd.to_datetime(sales['Date'])
memberships['Date'] = pd.to_datetime(memberships['Date'])

然后您用Person ID合并它们,并得到具有重复项的格式。

m = sales.merge(memberships, left_on='Person ID', right_on='Person ID',
                suffixes=('_sales', '_memberships'))
m

  Date_sales  Person ID Product Date_memberships    Type   Status
0 2012-01-05          1  cereal       2008-06-11    Gold      New
1 2012-01-05          1  cereal       2012-03-22    Gold  Renewal
2 2012-07-19          1    cake       2008-06-11    Gold      New
3 2012-07-19          1    cake       2012-03-22    Gold  Renewal
4 2012-01-05          2   apple       2011-10-12    Gold      New
5 2012-02-08          3    beef       2011-02-08  Silver  Renewal
6 2013-03-22         72     pot       2012-02-01    Gold  Renewal

现在您可以像这样计算销售和会员日期之间的经过天数:

m['TimeToPurchase'] = (m['Date_sales'] - m['Date_memberships']).dt.days
m

  Date_sales  Person ID Product Date_memberships    Type   Status  TimeToPurchase
0 2012-01-05          1  cereal       2008-06-11    Gold      New            1303
1 2012-01-05          1  cereal       2012-03-22    Gold  Renewal             -77
2 2012-07-19          1    cake       2008-06-11    Gold      New            1499
3 2012-07-19          1    cake       2012-03-22    Gold  Renewal             119
4 2012-01-05          2   apple       2011-10-12    Gold      New              85
5 2012-02-08          3    beef       2011-02-08  Silver  Renewal             365
6 2013-03-22         72     pot       2012-02-01    Gold  Renewal             415

在这里,您可以首先消除负面因素,然后为每个人员ID和日期销售获得最低TimeToPurchase

m = m[m['TimeToPurchase'] >= 0]
keep = m.groupby(['Person ID', 'Date_sales'], as_index=False)['TimeToPurchase'].min()
keep

 Person ID Date_sales  TimeToPurchase
         1 2012-01-05            1303
         1 2012-07-19             119
         2 2012-01-05              85
         3 2012-02-08             365
        72 2013-03-22             415

这些是您要保留在合并表中的记录,您可以使用内部联接对其进行过滤:

result = m.merge(keep, 
                 left_on=['Person ID', 'Date_sales', 'TimeToPurchase'], 
                 right_on=['Person ID', 'Date_sales', 'TimeToPurchase'])
result

Date_sales  Person ID Product Date_memberships    Type   Status  TimeToPurchase
2012-01-05          1  cereal       2008-06-11    Gold      New            1303
2012-07-19          1    cake       2012-03-22    Gold  Renewal             119
2012-01-05          2   apple       2011-10-12    Gold      New              85
2012-02-08          3    beef       2011-02-08  Silver  Renewal             365
2013-03-22         72     pot       2012-02-01    Gold  Renewal             415

data.table

与上述逻辑相同,因此我将粘贴代码。

# Date types
sales[, Date := as.Date(Date, format = "%m-%d-%Y")]
memberships[, Date := as.Date(Date, format = "%m-%d-%Y")]

# Merge
m <- memberships[sales, on = "Person ID"]

# Calculate elapsed days
m[, TimeToPurchase := as.numeric(m$i.Date - m$Date)]

# Eliminate negatives
m <- m[TimeToPurchase >= 0]

# Calculate records to keep
keep <- m[, .(TimeToPurchase = min(TimeToPurchase)), by = .(`Person ID`, i.Date)]

# Filter result
result <- m[keep, on = c("Person ID", "i.Date", "TimeToPurchase")]
result

         Date Person ID   Type  Status     i.Date Product TimeToPurchase
1: 2008-06-11         1   Gold     New 2012-01-05  cereal           1303
2: 2011-10-12         2   Gold     New 2012-01-05   apple             85
3: 2011-02-08         3 Silver Renewal 2012-02-08    beef            365
4: 2012-02-01        72   Gold Renewal 2013-03-22     pot            415
5: 2012-03-22         1   Gold Renewal 2012-07-19    cake            119

答案 1 :(得分:1)

以下是使用R和library(data.table)的解决方案。假设您只对最近的购买时间感兴趣

编辑:问题更新后

library(data.table)

purchaseDT <- data.table(stringsAsFactors=FALSE,
                         Date = c("01-05-2009", "01-05-2012", "02-08-2012", "03-22-2013"),
                         PersonID = c(1, 2, 1, 72),
                         Product = c("cereal", "apple", "beef", "pot")
)

programDT <- data.table(stringsAsFactors=FALSE,
                        Date = c("06-11-2008", "10-12-2011", "02-08-2011", "02-01-2012"),
                        PersonID = c(1, 2, 1, 72),
                        Type = c("Gold", "Gold", "Silver", "Gold"),
                        Status = c("New", "New", "Renewal", "Renewal")
)

purchaseDT[, PurchaseDate := as.Date(Date, format="%m-%d-%Y")]
programDT[, LicenseDate := as.Date(Date, format="%m-%d-%Y")]
purchaseDT[, Date := NULL]
programDT[, Date := NULL]

mergedDT <- purchaseDT[programDT, on="PersonID"]
mergedDT[, TimeToPurchase := PurchaseDate-LicenseDate]
mergedDT <- mergedDT[TimeToPurchase > 0]

resultDT <- mergedDT[, .(TimeToPurchase = min(TimeToPurchase)), by = c("LicenseDate", "PersonID")]
resultDT[, PurchaseDate := LicenseDate+TimeToPurchase]

print(resultDT)

结果:

   LicenseDate PersonID TimeToPurchase PurchaseDate
1:  2008-06-11        1       208 days   2009-01-05
2:  2011-10-12        2        85 days   2012-01-05
3:  2011-02-08        1       365 days   2012-02-08
4:  2012-02-01       72       415 days   2013-03-22

答案 2 :(得分:0)

这是您的一个主意。首先,我使用Person_IDDate合并了两个数据集。在此示例中,我需要在第一个mutate()中创建一个日期对象(即Date)。我按Person_IDDate对数据进行了排序。然后,我创建了一个新的分组变量。我所做的是确定其中Status是“ New”或“ Renewal”的行。这意味着我确定了许可证首次生效的时间。该行成为每个许可证的第一行。对于每个group,我选择前两行。数据按Person_IDDate排列,因此第二行应该是客户首次使用有效许可证购买的东西。最后,我使用time2purchase计算了间隔(即Date)。

full_join(df1, df2, by = c("Person_ID", "Date")) %>%
mutate(Date = as.Date(Date, format = "%m-%d-%Y")) %>%
arrange(Person_ID, Date) %>%
mutate(group = findInterval(x = 1:n(), vec = grep(Status, pattern = "New|Renewal"))) %>%
group_by(group) %>%
slice(1:2) %>%
summarize(time2purchase = Date[2]-Date[1])

  group time2purchase
  <int> <time>       
1     1 1303 days    
2     2  119 days    
3     3   85 days    
4     4  365 days    
5     5  415 days   

To make things clearer, I leave the results below, which you can generate
using mutate() instead of summarize().

  Date       Person_ID Product Type   Status  group time2purchase
  <date>         <int> <chr>   <chr>  <chr>   <int> <time>       
1 2008-06-11         1 NA      Gold   New         1 1303 days    
2 2012-03-22         1 NA      Gold   Renewal     2  119 days    
3 2011-10-12         2 NA      Gold   New         3   85 days    
4 2011-02-08         3 NA      Silver Renewal     4  365 days    
5 2012-02-01        72 NA      Gold   Renewal     5  415 days

数据

df1 <-structure(list(Date = c("01-05-2012", "01-05-2012", "02-08-2012", 
"03-22-2013", "07-19-2012"), Person_ID = c(1L, 2L, 3L, 72L, 1L
), Product = c("cereal", "apple", "beef", "pot", "cake")), class = "data.frame", 
row.names = c(NA, 
-5L))

df2 <- structure(list(Date = c("06-11-2008", "10-12-2011", "02-08-2011", 
"02-01-2012", "03-22-2012"), Person_ID = c(1L, 2L, 3L, 72L, 1L
), Type = c("Gold", "Gold", "Silver", "Gold", "Gold"), Status = c("New", 
"New", "Renewal", "Renewal", "Renewal")), class = "data.frame", row.names = c(NA, 
-5L))