比较同一数据框中的组中的行

时间:2018-04-25 12:41:47

标签: r dplyr data.table lag

我的数据如下:

library(dplyr)
library(data.table)

df <- data.frame(
  customernumber = c("111", "111", "111",  "111", "111","222", "222", "222", "222", "222", "222", "222"), 
  ordernumber = c("1", "1", "1", "2", "2", "1", "1", "1", "1", "2", "2", "3"), 
  article = c("JeansA", "JeansA", "ShirtA", "JeansA", "JeansB", "ShirtA", "ShirtB", "ShirtB", "JeansA", "JeansB", "ShirtA", "JeansB"), 
  size = c("40", "42", "40", "42", "44", "36", "36", "40", "40", "38", "44", "36"), 
  returned = c("1", "1", "0", "0", "1", "1", "1", "0", "0", "0", "0", "0")
)

输出:

   customernumber ordernumber article size returned
1             111           1  JeansA   40        1
2             111           1  JeansA   42        1
3             111           1  ShirtA   40        0
4             111           2  JeansA   42        0
5             111           2  JeansB   44        1
6             222           1  ShirtA   36        1
7             222           1  ShirtB   36        1
8             222           1  ShirtB   40        0
9             222           1  JeansA   40        0
10            222           2  JeansB   38        0
11            222           2  ShirtA   44        0
12            222           3  JeansB   36        0

现在我想标记每个客户的所有订单,其中已经退回了一篇文章,但是在下一个订单中以不同的尺寸再次订购。因此,所有只交换的文章因此不能真正被视为新订单。所以最终的结果应该是这样的:

结果:

   customernumber ordernumber article size returned changed
1             111           1  JeansA   40        1       0
2             111           1  JeansA   42        1       0
3             111           1  ShirtA   40        0       0
4             111           2  JeansA   42        0       1
5             111           2  JeansB   44        1       0
6             222           1  ShirtA   36        1       0
7             222           1  ShirtB   36        1       0
8             222           1  ShirtB   40        0       0
9             222           1  JeansA   40        0       0
10            222           2  JeansB   38        0       0
11            222           2  ShirtA   44        0       1
12            222           3  JeansB   36        0       0

我认为我可以通过使用dyplr(或data.table)引入滞后变量来解决问题,但我只能设法在同一组内滞后变量,但我无法将其延迟到下一组。这是:

df %>% 
  group_by(customernumber, ordernumber, article) %>% 
  mutate(lag_size = lag(size, order_by = article))

或:

df <- data.table(df)
setorder(df, customernumber, ordernumber, article)
df[,lag_size := shift(size), by = .(customernumber, ordernumber, article)]

我不想考虑for循环(甚至不确定它是否会解决问题),因为数据集非常大并且需要年龄。而且我总体上缺乏想法。所以任何帮助都表示赞赏。

谢谢!

附加元件:

我偶然发现了与此案有关的另一个问题。我只想在下一个跟进订单中标记已在另一个尺寸中订购的文章,如果已更改,则不会在同一尺寸的同一文章再次订购时。所以变量的标准是:

订单n:返回== 1 订单n + 1:相同的文章,不同的尺寸 - &gt;更改== 1(否则更改== 0)

以下是更新的示例:

df <- data.frame(
 customernumber = c("111", "111", "111",  "111", "111", "111","222", "222", "222", "222", "222", "222", "222"), 
 ordernumber = c("1", "1", "1", "2", "2", "2", "1", "1", "1", "1", "2", "2", "3"), 
 article = c("JeansA", "JeansA", "ShirtA", "JeansA", "JeansA", "JeansB", "ShirtA", "ShirtB", "ShirtB", "JeansA", "JeansB", "ShirtA", "JeansB"), 
 size = c("40", "42", "40", "40", "44", "44", "36", "36", "40", "40", "38", "44", "36"), 
 returned = c("1", "1", "0", "0", "1", "1", "1", "1", "0", "0", "0", "0", "0")
)

输出:

   customernumber ordernumber article size returned
1             111           1  JeansA   40        1
2             111           1  JeansA   42        1
3             111           1  ShirtA   40        0
4             111           2  JeansA   40        0
5             111           2  JeansA   44        1
6             111           2  JeansB   44        1
7             222           1  ShirtA   36        1
8             222           1  ShirtB   36        1
9             222           1  ShirtB   40        0
10            222           1  JeansA   40        0
11            222           2  JeansB   38        0
11            222           2  ShirtA   44        0
12            222           3  JeansB   36        0

结果:

   customernumber ordernumber article size returned changed
1             111           1  JeansA   40        1       0
2             111           1  JeansA   42        1       0
3             111           1  ShirtA   40        0       0
4             111           2  JeansA   40        0       0
5             111           2  JeansA   44        1       1
6             111           2  JeansB   44        1       0   
7             222           1  ShirtA   36        1       0
8             222           1  ShirtB   36        1       0
9             222           1  ShirtB   40        0       0
10            222           1  JeansA   40        0       0
11            222           2  JeansB   38        0       0
11            222           2  ShirtA   44        0       1
12            222           3  JeansB   36        0       0

很抱歉这个混乱,我实际上在我的例子中犯了一个错误并错误填写了更改的变量。如果你还在帮助我,我会非常感激。

谢谢!

2 个答案:

答案 0 :(得分:2)

新答案:

library(data.table) setDT(df) df[, changed := 0 ][df[df, on = .(customernumber, ordernumber < ordernumber, article), nomatch = 0 ][size != i.size & returned == 1, .SD[!i.size %in% size], by = .(customernumber, ordernumber, article) ][, .(customernumber, ordernumber, article, size = i.size)][, unique(.SD)] , on = .(customernumber, ordernumber, article, size), changed := 1][] 的可能解决方案:

    customernumber ordernumber article size returned changed
 1:            111           1  JeansA   40        1       0
 2:            111           1  JeansA   42        1       0
 3:            111           1  ShirtA   40        0       0
 4:            111           2  JeansA   40        0       0
 5:            111           2  JeansA   44        1       1
 6:            111           2  JeansB   44        1       0
 7:            222           1  ShirtA   36        1       0
 8:            222           1  ShirtB   36        1       0
 9:            222           1  ShirtB   40        0       0
10:            222           1  JeansA   40        0       0
11:            222           2  JeansB   38        0       0
12:            222           2  ShirtA   44        0       1
13:            222           3  JeansB   36        0       0

给出:

library(data.table)
setDT(df)

df[df[returned == 0][df[returned == 1]
                     , on = .(customernumber, article)
                     ][ordernumber != i.ordernumber]
   , on = .(customernumber, article, returned)
   , changed := i.returned
   ][, changed := replace(changed, is.na(changed), 0)][]

旧回答:

    customernumber ordernumber article size returned changed
 1:            111           1  JeansA   40        1       0
 2:            111           1  JeansA   42        1       0
 3:            111           1  ShirtA   40        0       0
 4:            111           2  JeansA   42        0       1
 5:            111           2  JeansB   44        1       0
 6:            222           1  ShirtA   36        1       0
 7:            222           1  ShirtB   36        1       0
 8:            222           1  ShirtB   40        0       0
 9:            222           1  JeansA   40        0       0
10:            222           2  JeansB   38        0       0
11:            222           2  ShirtA   44        0       1
12:            222           3  JeansB   36        0       0

给出:

.gallery-container {
position: relative;
}

.gallery-img {
display: block;
width: 100%;
height: auto;
}

.gallery-overlay {  
position: absolute;
top: 0;
bottom: 0;
left: 0;
right: 0;
opacity: 0;
transition: .5s ease;
background-color: rgba(255,255,255,0.5);
display: flex;
align-items: center;
justify-content: center;
font-size: 50px;
}

.gallery-container:hover .gallery-overlay {
opacity: 1;
}

答案 1 :(得分:0)

您正在处理多个滞后条件,因此我们需要多个if ("your_component" != null){ //your code } 命令来创建该条件。然后,我们可以使用lag创建case_when列。

changed