我正在开展一个历史项目,在那里我们需要了解每个月的员工数量。对于数据集中的每个人,我都有他们被雇用的时期。 fname_code代码用于不同的工作职位。举例来说,Edmont Privat博士在以下各个时期有两个不同的职能:
pname fname_code begin_date end_date
1 Dr. Edmond Privat 3 1921-09-02 1921-10-07
2 Dr. Edmond Privat 2 1921-12-07 1922-03-06
3 Joseph Louis Marie Charles Avenol 1 1923-02-01 1933-07-01
4 Joseph Louis Marie Charles Avenol 1 1933-07-01 1940-08-31
5 Dr. G. G. Kullmann 2 1931-03-30 1938-12-15
我的想法是将此信息转换为主题/日期时段数据框/矩阵,其中N表示该人此时不在公司,而数字表示他们已被雇用并且他们拥有排名。这是我想到的一个例子:
1944-07-01 1944-08-01 1944-09-01 1944-10-01
Albert Dan Meurig Evans N N N N
Genevieve Jeanne Leonie Mayor N 2 3 3
我做了一件可以完成这项工作的东西 - 它已经完成了上述工作 - 然而,代码并不适用于虚假的程序员(许多foreloops!)。我的问题是,如果你们有经验的程序员有建议提高速度,或以完全不同的方式获得我的目标。 我尝试过使用一些dplyr函数,但我对它们的经验太少,无法使它们工作。我还考虑创建一个if条件来处理一个人刚被雇用一段时间的情况,因为在这种情况下不需要forloop - 但我不确定在哪里以最佳方式实施它。 / p>
我的计算灾难的逻辑是查看数据框和就业数据框之间的月间隔是否存在重叠:
library(lubridate)
library(tidyverse)
#creating sequence of dates for columns
start_date <- as.Date("1919-01-01")
end_date <- as.Date("1948-12-30")
dates <- seq.Date(start_date, end_date, by ="month")
#dates as columns and names on columns
test.df <- matrix(ncol =length(dates), nrow = nlevels(mdl_df$pname))
test.df <- as.data.frame(test.df)
colnames(test.df) <- dates
rownames(test.df) <- levels(mdl_df$pname)
for (name in 1:nlevels(mdl_df$pname)){
#subsetting the rows for each person
person_rows <- mdl_df %>% filter( mdl_df$pname == rownames(test.df)[name])
for (date in 1:(length(dates)-1)) {
#Creating a month interval consisting of the time between two adjecent months
interval1 <- interval(ymd(colnames(test.df)[date]),ymd(colnames(test.df)[date+1]))
for (row in 1:nrow(person_rows)) {
#check if overlap between df month interval and employment intervals.
interval2 <- interval(ymd(person_rows$begin_date[row]),ymd(person_rows$end_date[row]))
if (int_overlaps(interval1, interval2)){
#checking if df period and work period overlap. If so rank is inserted otherwise N is entered
test.df[name,date] <- test_rows$fname_code[row]
break
}else{
test.df[name,date] <- "N"
}
}
}
}
该数据集由大约3000名员工组成,我的计算机需要大约6-7个小时才能完成这项工作。我需要在接下来的几周内在各种数据集上多次运行并重新运行该脚本,所以非常感谢任何帮助!
编辑:前50行数据集的输出输出。
> dput(droplevels(head(mdl_df, 50)))
structure(list(pname = structure(c(7L, 7L, 24L, 24L, 8L, 19L,
16L, 16L, 16L, 4L, 34L, 11L, 17L, 12L, 23L, 10L, 14L, 14L, 14L,
14L, 14L, 32L, 5L, 22L, 29L, 3L, 13L, 25L, 2L, 6L, 26L, 18L,
21L, 27L, 27L, 28L, 20L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 15L, 31L,
33L, 30L, 30L, 1L), .Label = c("A. Gordon Bagnall", "Bertil Gotthard Ohlin",
"Birgit Nissen", "Bryan Fullerton Adams", "C.H. Wykes", "Christian Olsen",
"Dr. Edmond Privat", "Dr. G. G. Kullmann", "Eugène Henri René Vigier",
"Ewan P. Wallis-Jones", "Francis Yeats-Brown", "Francisco Walker-Linares",
"Frank Horsfall Nixon", "Frank Paul Walters", "Franklin Urteaga",
"Gerald Heguerty Furtado Abraham", "Gladys Wade", "Guillaume Théodore Conrad Zwerner",
"Henri Bonnet", "Haakon Vigander", "Ignacio J. Valdes", "Ingvad Nielsen",
"Jessie Irene Wall", "Joseph Louis Marie Charles Avenol", "Julian Nogueira",
"Konni Zilliacus", "Luis Varela-Obregoso", "Marc Veillet-Lavallee",
"Maria Nielsen", "Peter Martin Anker", "Pierre Achille Louis Eugène Quesnay",
"Pierre Henry Watier", "Prof. Fred Alexander", "Robert André Felix Bach"
), class = "factor"), fname_code = c(3L, 2L, 1L, 1L, 2L, 2L,
2L, 0L, 2L, 4L, 2L, 2L, 2L, 4L, 2L, 2L, 3L, 2L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 4L, 2L, 2L, 6L,
0L, 0L, 1L, 2L, 3L, 6L, 2L, 2L, 2L, 2L, 2L, 2L), begin_date = structure(c(-17653,
-17557, -17136, -13333, -14157, -17897, -18050, -13789, -8962,
-15010, -11810, -15372, -14003, -14855, -16047, -12900, -18494,
-18254, -14245, -13333, -11172, -12008, -18398, -14360, -15002,
-11802, -17883, -12862, -14245, -17136, -18248, -14975, -13989,
-15494, -15372, -14108, -14738, -18201, -17849, -17849, -11657,
-10592, -10579, -10130, -11436, -16849, -13631, -14033, -11161,
-12620), class = "Date"), end_date = structure(c(-17618, -17468,
-13333, -10715, -11340, -14243, -13789, -11223, -8624, -11178,
-10797, -17543, -13982, -8555, -15628, -12879, -18254, -14245,
-13333, -11172, -10809, -11822, -18255, -14339, -14988, -11781,
-17078, -11158, -13958, -16590, -11401, -14610, -13968, -15434,
-15007, -13920, -14717, -17849, -8524, -8524, -8524, -8524, -8524,
-8524, -11415, -15707, -13613, -11161, -8555, -12614), class = "Date")),
.Names = c("pname", "fname_code", "begin_date", "end_date"), row.names = c(NA, 50L), class = "data.frame")
运行Rstudio v.1.0.136
附件包:
[1] dplyr_0.7.1 purrr_0.2.2.2 readr_1.1.1 tidyr_0.6.3 tibble_1.3.3 ggplot2_2.2.1
[7] tidyverse_1.1.1 lubridate_1.6.0
答案 0 :(得分:3)
data.table
软件包的1.9.8版(在2016年11月25日CRAN上)引入了非equi连接,这些非常有用,可以找到工作时间与月间隔的重叠。然后使用dcast()
从长格式转换为宽格式。
library(data.table)
# coerce to data.table
setDT(mdl_df)[
# right join with sequence of monthly intervals
.(mseq = seq(as.Date("1944-01-01"), length.out = 4L, by = "1 month")),
# using non-equi join conditions
on = .(begin_date <= mseq, end_date >= mseq)][
# reshape from wide to long format,
# show rank (concatenate in case of multiple ranks)
, dcast(unique(.SD), pname ~ end_date, toString, value.var = "fname_code")]
pname 1944-01-01 1944-02-01 1944-03-01 1944-04-01 1: Eugène Henri René Vigier 0, 1, 2, 3, 6 0, 1, 2, 3, 6 0, 1, 2, 3, 6 0, 1, 2, 3, 6 2: Francisco Walker-Linares 4 4 4 4 3: Peter Martin Anker 2 2 2 2
In his comment,OP要求涵盖1919-01-01至1948-12-30期间。在这里,我们需要修改连接参数:
result <- setDT(mdl_df)[
.(mseq = seq(as.Date("1919-01-01"), as.Date("1948-12-30"), by = "1 month")),
on = .(begin_date <= mseq, end_date >= mseq), nomatch = 0L, allow.cartesian = TRUE][
, dcast(unique(.SD), pname ~ end_date, toString, value.var = "fname_code")]
result
由27行和328列组成,只能以部分打印:
result[, 1:5]
pname 1919-06-01 1919-07-01 1919-08-01 1919-09-01 1: Bertil Gotthard Ohlin 2: Bryan Fullerton Adams 3: C.H. Wykes 2 4: Christian Olsen 5: Dr. Edmond Privat 6: Dr. G. G. Kullmann 7: Eugène Henri René Vigier 8: Francisco Walker-Linares 9: Frank Horsfall Nixon 10: Frank Paul Walters 3 3 3 3 11: Franklin Urteaga 12: Gerald Heguerty Furtado Abraham 13: Gladys Wade 14: Guillaume Théodore Conrad Zwerner 15: Henri Bonnet 16: Haakon Vigander 17: Ignacio J. Valdes 18: Jessie Irene Wall 19: Joseph Louis Marie Charles Avenol 20: Julian Nogueira 21: Konni Zilliacus 22: Luis Varela-Obregoso 23: Marc Veillet-Lavallee 24: Peter Martin Anker 25: Pierre Achille Louis Eugène Quesnay 26: Pierre Henry Watier 27: Robert André Felix Bach pname 1919-06-01 1919-07-01 1919-08-01 1919-09-01
请注意,显示的第一个日期是1919-06-01,因为没有先前的匹配项。同样,最后一栏328是1946-08-01。
result[, c(1, 328 - 2:0)]
pname 1946-06-01 1946-07-01 1946-08-01 1: Bertil Gotthard Ohlin 2: Bryan Fullerton Adams 3: C.H. Wykes 4: Christian Olsen 5: Dr. Edmond Privat 6: Dr. G. G. Kullmann 7: Eugène Henri René Vigier 0, 1, 2, 3, 6 0, 1, 2, 3, 6 0, 1, 2, 3, 6 8: Francisco Walker-Linares 4 4 9: Frank Horsfall Nixon 10: Frank Paul Walters 11: Franklin Urteaga 12: Gerald Heguerty Furtado Abraham 13: Gladys Wade 14: Guillaume Théodore Conrad Zwerner 15: Henri Bonnet 16: Haakon Vigander 17: Ignacio J. Valdes 18: Jessie Irene Wall 19: Joseph Louis Marie Charles Avenol 20: Julian Nogueira 21: Konni Zilliacus 22: Luis Varela-Obregoso 23: Marc Veillet-Lavallee 24: Peter Martin Anker 2 2 25: Pierre Achille Louis Eugène Quesnay 26: Pierre Henry Watier 27: Robert André Felix Bach pname 1946-06-01 1946-07-01 1946-08-01
请注意,mdl_df
有mdl_df[, uniqueN(pname)]
个唯一名称,而result
只有27个。这是由于数据不一致,即begin_date
稍后end_date
或太短的时间:
# anti-join
mdl_df[!result, on = "pname"]
pname fname_code begin_date end_date 1: Francis Yeats-Brown 2 1927-12-01 1921-12-21 2: Ewan P. Wallis-Jones 2 1934-09-07 1934-09-28 3: Ingvad Nielsen 2 1930-09-08 1930-09-29 4: Maria Nielsen 2 1928-12-05 1928-12-19 5: Birgit Nissen 2 1937-09-09 1937-09-30 6: Prof. Fred Alexander 2 1932-09-06 1932-09-24 7: A. Gordon Bagnall 2 1935-06-14 1935-06-20