我已经在本网站上阅读了关于滚动总和的所有Q& A但我无法理解大多数复杂的代码,因此我的调整技巧有限。 我尝试实施一些建议的解决方案,here,here和here等,但是我得到错误或计算机崩溃,即使我只使用1,000行和3列。因此很清楚,我搞砸了代码。
我的数据看起来像这样(前50行通过dput)。总数据集大约为100,000行
structure(list(pnum = c("4778744", "4778744", "4778744", "4832724",
"4840655", "4854957", "4952026", "4832724", "4832724", "4840655",
"4952026", "4854957", "4952026", "4979975", "5062877", "5062877",
"4979975", "4979975", "4979975", "5093287", "5148510", "5093287",
"5148510", "5093287", "5148510", "5093287", "5148510", "5093287",
"5148510", "5093287", "5148510", "5093287", "5148510", "5212120",
"5375012", "5168079", "5375012", "5212120", "5212120", "5168079",
"4811345", "4851990", "4947366", "5142672", "5317715", "4878166",
"4851990", "5142672", "5317715", "4878166", "5142672", "5317715",
"4878166", "5142672", "5317715", "4878166", "5142672", "5317715",
"4878166", "5185878", "4926323", "4926323", "4926323", "4926323",
"5185878", "4926323", "4926323", "4926323", "4926323", "4926323",
"4926323", "5129067", "5136697", "5210841", "5237700", "5237700",
"5237700", "5247644", "5805912", "5828869", "5357626", "5247644",
"5805912", "5828869", "5357626"), ID = c("03859643-1", "04488864-4",
"04560399-1", "03859643-1", "03859643-1", "03859643-1", "03859643-1",
"03901719-2", "04086089-2", "04086089-2", "04407934-2", "04488864-4",
"04952026-3", "03859643-1", "03859643-1", "03901719-2", "03912481-3",
"03940277-1", "04979975-2", "03859643-1", "03859643-1", "03864113-1",
"03864113-1", "04877300-1", "04877300-1", "04877300-3", "04877300-3",
"05040862-3", "05040862-3", "05093287-4", "05093287-4", "05093287-6",
"05093287-6", "03859643-1", "03859643-1", "03859643-1", "03870399-2",
"03901719-2", "03923529-1", "04784976-1", "03860454-2", "03860454-2",
"03860454-2", "03860454-2", "03860454-2", "03860454-2", "04761567-2",
"04870622-2", "04870622-2", "04870622-2", "04878166-2", "04878166-2",
"04878166-2", "04878166-3", "04878166-3", "04878166-3", "04878166-5",
"04878166-5", "04878166-5", "03860454-2", "03860454-2", "04610004-1",
"04734852-2", "04734852-3", "04761567-2", "04761567-2", "04777587-1",
"04835414-1", "04878166-2", "04926323-10", "04926323-5", "03860454-2",
"03860454-2", "03860454-2", "03860454-2", "05237700-2", "05237700-3",
"03860454-2", "03860454-2", "03860454-2", "03860454-2", "04731737-1",
"04731737-1", "04731737-1", "04731737-1"), Time = c(1986L, 1986L,
1986L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L,
1988L, 1988L, 1989L, 1989L, 1989L, 1989L, 1989L, 1989L, 1990L,
1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 1990L,
1990L, 1990L, 1990L, 1990L, 1991L, 1991L, 1991L, 1991L, 1991L,
1991L, 1991L, 1986L, 1987L, 1987L, 1987L, 1987L, 1987L, 1987L,
1987L, 1987L, 1987L, 1987L, 1987L, 1987L, 1987L, 1987L, 1987L,
1987L, 1987L, 1987L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L,
1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 1989L, 1989L, 1990L,
1990L, 1990L, 1990L, 1991L, 1991L, 1991L, 1991L, 1991L, 1991L,
1991L, 1991L)), .Names = c("pnum", "inventor", "pryear"), row.names = c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L,
16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L,
29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 325L,
326L, 327L, 328L, 329L, 330L, 331L, 332L, 333L, 334L, 335L, 336L,
337L, 338L, 339L, 340L, 341L, 342L, 343L, 344L, 345L, 346L, 347L,
348L, 349L, 350L, 351L, 352L, 353L, 354L, 355L, 356L, 357L, 358L,
359L, 360L, 361L, 362L, 363L, 364L, 365L, 366L, 367L, 368L, 369L
), class = "data.frame")
多个inventors
在名为pnum
的特定年份的项目pryear
上进行协作。我正在寻找三件事:
在@Thierry的评论之后,我更改了数据样本,以确保他指出的问题得到了处理。
pryear
之前的x(比如3年)窗口中进行的项目数量,因此,如果当前项目的年份是1977年,我想要从1974年到1976年进行的项目数量包括在内。如果之前没有出现过,理想情况下结果为“0”。 @Alex here提供的答案可用于实现第一个目标。但正如评论中所讨论的那样,效率并不高(特别是因为我的时间范围是从1952年到2010年,超过50,000名发明者)。答案 0 :(得分:0)
这是第一个问题的解决方案。你可以解决其他问题作为练习。
第一个解决方案仅使用dplyr
。您可能会遇到大型数据集的问题。
library(dplyr)
df %>%
inner_join(
df %>%
select(inventor, oldyear = pryear),
by = "inventor") %>%
filter(pryear - 3 <= oldyear, oldyear < pryear) %>%
group_by(inventor, pryear) %>%
summarise(projects = n())
第二个解决方案使用dplyr
和数据库后端。这应该能够处理更大的数据集。请注意,代码非常相似。
library(RSQLite)
library(dplyr)
conn <- dbConnect(SQLite(), "test")
dbWriteTable(conn, "project", df)
src <- src_sqlite("test")
tbl(src, "project") %>%
inner_join(
tbl(src, "project") %>%
select(inventor, oldyear = pryear),
by = "inventor") %>%
filter(pryear - 3 <= oldyear, oldyear < pryear) %>%
group_by(inventor, pryear) %>%
summarise(projects = n()) %>%
collect()