我有一个玩具的例子。 对按x
分组的两个连续y行进行求和的最有效方法是什么
library(tibble)
l = list(x = c("a", "b", "a", "b", "a", "b"), y = c(1, 4, 3, 3, 7, 0))
df <- as_tibble(l)
df
#> # A tibble: 6 x 2
#> x y
#> <chr> <dbl>
#> 1 a 1
#> 2 b 4
#> 3 a 3
#> 4 b 3
#> 5 a 7
#> 6 b 0
所以输出就像这样
group sum seq
a 4 1
a 10 2
b 7 1
b 3 2
我想使用RcppRoll包中的tidyverse和可能的roll_sum() 并拥有代码,以便可变长度的连续行可用于存在许多组的真实世界数据
TIA
答案 0 :(得分:7)
执行此操作的一种方法是使用group_by %>% do
,您可以在do
中自定义返回的数据框:
library(RcppRoll); library(tidyverse)
n = 2
df %>%
group_by(x) %>%
do(
data.frame(
sum = roll_sum(.$y, n),
seq = seq_len(length(.$y) - n + 1)
)
)
# A tibble: 4 x 3
# Groups: x [2]
# x sum seq
# <chr> <dbl> <int>
#1 a 4 1
#2 a 10 2
#3 b 7 1
#4 b 3 2
编辑:由于这不是那么有效,可能是由于数据框构造标题和移动中的绑定数据帧,这里是一个改进的版本(仍然比data.table
慢一些但现在没那么多了):
df %>%
group_by(x) %>%
summarise(sum = list(roll_sum(y, n)), seq = list(seq_len(n() -n + 1))) %>%
unnest()
时间,使用@Matt的数据和设置:
library(tibble)
library(dplyr)
library(RcppRoll)
library(stringi) ## Only included for ability to generate random strings
## Generate data with arbitrary number of groups and rows --------------
rowCount <- 100000
groupCount <- 10000
sumRows <- 2L
set.seed(1)
l <- tibble(x = sample(stri_rand_strings(groupCount,3),rowCount,rep=TRUE),
y = sample(0:10,rowCount,rep=TRUE))
## Using dplyr and tibble -----------------------------------------------
ptm <- proc.time() ## Start the clock
dplyr_result <- l %>%
group_by(x) %>%
summarise(sum = list(roll_sum(y, n)), seq = list(seq_len(n() -n + 1))) %>%
unnest()
dplyr_time <- proc.time() - ptm ## Stop the clock
## Using data.table instead ----------------------------------------------
library(data.table)
ptm <- proc.time() ## Start the clock
setDT(l) ## Convert l to a data.table
dt_result <- l[,.(sum = RcppRoll::roll_sum(y, n = sumRows, fill = NA, align = "left"),
seq = seq_len(.N)),
keyby = .(x)][!is.na(sum)]
data.table_time <- proc.time() - ptm
结果是:
dplyr_time
# user system elapsed
# 0.688 0.003 0.689
data.table_time
# user system elapsed
# 0.422 0.009 0.430
答案 1 :(得分:6)
这是一种方法。由于您要总结两个连续的行,因此可以使用lead()
并对sum
进行计算。对于seq
,我认为您可以简单地获取行号,看看您的预期结果。完成这些操作后,您可以按x
(如有必要,x
和seq
)排列数据。最后,删除具有NA的行。如有必要,您可以通过在代码末尾写y
来删除select(-y)
。
group_by(df, x) %>%
mutate(sum = y + lead(y),
seq = row_number()) %>%
arrange(x) %>%
ungroup %>%
filter(complete.cases(.))
# x y sum seq
# <chr> <dbl> <dbl> <int>
#1 a 1 4 1
#2 a 3 10 2
#3 b 4 7 1
#4 b 3 3 2
答案 2 :(得分:5)
我注意到你要求最有效的方式 - 如果你正在考虑将其扩展到更大的集合,我强烈建议使用data.table。
library(data.table)
library(RcppRoll)
l[, .(sum = RcppRoll::roll_sum(y, n = 2L, fill = NA, align = "left"),
seq = seq_len(.N)),
keyby = .(x)][!is.na(sum)]
使用包含100,000行和10,000组的tidyverse软件包对比这个答案的粗略基准比较说明了显着差异。
(我使用了Psidom的答案而不是jazzurro,因为jazzuro不允许对一些行数进行求和。)
library(tibble)
library(dplyr)
library(RcppRoll)
library(stringi) ## Only included for ability to generate random strings
## Generate data with arbitrary number of groups and rows --------------
rowCount <- 100000
groupCount <- 10000
sumRows <- 2L
set.seed(1)
l <- tibble(x = sample(stri_rand_strings(groupCount,3),rowCount,rep=TRUE),
y = sample(0:10,rowCount,rep=TRUE))
## Using dplyr and tibble -----------------------------------------------
ptm <- proc.time() ## Start the clock
dplyr_result <- l %>%
group_by(x) %>%
do(
data.frame(
sum = roll_sum(.$y, sumRows),
seq = seq_len(length(.$y) - sumRows + 1)
)
)
|========================================================0% ~0 s remaining
dplyr_time <- proc.time() - ptm ## Stop the clock
## Using data.table instead ----------------------------------------------
library(data.table)
ptm <- proc.time() ## Start the clock
setDT(l) ## Convert l to a data.table
dt_result <- l[,.(sum = RcppRoll::roll_sum(y, n = sumRows, fill = NA, align = "left"),
seq = seq_len(.N)),
keyby = .(x)][!is.na(sum)]
data.table_time <- proc.time() - ptm ## Stop the clock
结果:
> dplyr_time
user system elapsed
10.28 0.04 10.36
> data.table_time
user system elapsed
0.35 0.02 0.36
> all.equal(dplyr_result,as.tibble(dt_result))
[1] TRUE
答案 3 :(得分:4)
使用tidyverse
和zoo
的解决方案。这类似于Psidom的方法。
library(tidyverse)
library(zoo)
df2 <- df %>%
group_by(x) %>%
do(data_frame(x = unique(.$x),
sum = rollapplyr(.$y, width = 2, FUN = sum))) %>%
mutate(seq = 1:n()) %>%
ungroup()
df2
# A tibble: 4 x 3
x sum seq
<chr> <dbl> <int>
1 a 4 1
2 a 10 2
3 b 7 1
4 b 3 2
答案 4 :(得分:1)
zoo
+ dplyr
library(zoo)
library(dplyr)
df %>%
group_by(x) %>%
mutate(sum = c(NA, rollapply(y, width = 2, sum)),
seq = row_number() - 1) %>%
drop_na()
# A tibble: 4 x 4
# Groups: x [2]
x y sum seq
<chr> <dbl> <dbl> <dbl>
1 a 3 4 1
2 b 3 7 1
3 a 7 10 2
4 b 0 3 2
如果移动窗口仅使用lag
df %>%
group_by(x) %>%
mutate(sum = y + lag(y),
seq = row_number() - 1) %>%
drop_na()
# A tibble: 4 x 4
# Groups: x [2]
x y sum seq
<chr> <dbl> <dbl> <dbl>
1 a 3 4 1
2 b 3 7 1
3 a 7 10 2
4 b 0 3 2
编辑:
n = 3 # your moving window
df %>%
group_by(x) %>%
mutate(sum = c(rep(NA, n - 1), rollapply(y, width = n, sum)),
seq = row_number() - n + 1) %>%
drop_na()
答案 5 :(得分:0)
现有答案的一个小变体:首先将数据转换为列表列格式,然后将purrr
map()
roll_sum()
用于数据。
l = list(x = c("a", "b", "a", "b", "a", "b"), y = c(1, 4, 3, 3, 7, 0))
as.tibble(l) %>%
group_by(x) %>%
summarize(list_y = list(y)) %>%
mutate(rollsum = map(list_y, ~roll_sum(.x, 2))) %>%
select(x, rollsum) %>%
unnest %>%
group_by(x) %>%
mutate(seq = row_number())
我认为如果您拥有purrr
的最新版本,则可以使用group_by()
而不是mutate()
删除最后两行(最终imap()
和@Entity
@Table(name = "sls_notifications")
public class SLSNotification {
@Id
@GeneratedValue(strategy = GenerationType.AUTO)
@Column(length = 11)
private Integer snumber;
@JsonFormat(pattern="yyyy-MM-dd")
@Column(nullable = false)
private Date date = new Date();
@Column(length = 8)
private String cusOffice;
@Column(length = 1)
private String cusSerial;
@Column(length = 50)
private String cusDecNo;
@JsonFormat(pattern="yyyy-MM-dd")
private Date cusDate;
@Column(length = 300)
private String manufacturer;
@Column(length = 300)
private String exporterAddress;
@Column(length = 20)
private String importerVAT;
@NotEmpty
@Column(length = 20, nullable = false)
private String declarantVAT;
private String declarantDetails;
private String vessel;
private String blNo;
private String loadingPort;
private String tradingCountry;
private String countryOrigin;
private String invoiceNo;
@JsonFormat(pattern="yyyy-MM-dd")
private Date invoiceDate;
private Double invoiceValue;
private String uom;
private Double totalQty;
private String marksNumber;
private String goodsDesc;
private String purpose;
private String hsCode;
private String issuerQltyCert;
private String qltyCertifacateNo;
private String slsNo;
private String invoiceLoc;
private String blLoc;
private String packlistLoc;
private String qcLoc;
private String otherLoc;
private String accRep;
private String accRepLoc;
@NotEmpty
@Column(length = 255, nullable = false)
private String status = "PENDING";
private String userId;
private String slsiUnit;
private String importerDetails;
private String productDesc;
private String certRefNo;
@JsonFormat(pattern="yyyy-MM-dd")
private Date blDate;
private String loadCountry;
}
)地图。