假设我有数据框:
df <- data.frame(City = c("NY", "NY", "NY", "NY", "NY", "LA", "LA", "LA", "LA"),
YearFrom = c("2001", "2003", "2002", "2006", "2008", "2004", "2005", "2005", "2002"),
YearTo = c(NA, "2005", NA, NA, "2009", NA, "2008", NA, NA))
其中YearFrom是例如公司成立,YearTo是取消的年份。 如果YearTo是NA,那么它仍在工作。
我想计算每年的公司数量。
表格应如下所示
City |"Year" |"Count"
"NY" |2001 1
"NY" |2002 2
"NY" |2003 3
"NY" |2004 3
"NY" |2005 2
"NY" |2006 3
"NY" |2007 3
"NY" |2008 4
"NY" |2009 3
"LA" |2001 0
"LA" |2002 1
"LA" |2003 1
"LA" |2004 2
"LA" |2005 4
"LA" |2006 4
"LA" |2007 4
"LA" |2008 2
"LA" |2009 2
我想通过dplyr或datatable包来解决这个问题,但我无法弄清楚如何解决这个问题?
答案 0 :(得分:8)
首先,清理数据......
curr_year = as.integer(year(Sys.Date()))
library(data.table)
setDT(df)
df[, YearTo := as.integer(as.character(YearTo)) ]
df[, YearFrom := as.integer(as.character(YearFrom)) ]
df[, quasiYearTo := YearTo ]
df[is.na(YearTo), quasiYearTo := curr_year ]
然后,非等于加入:
df[CJ(City = City, Year = min(YearFrom):max(YearTo, na.rm=TRUE), unique=TRUE),
on=.(City, YearFrom <= Year, quasiYearTo > Year), allow.cartesian = TRUE,
.N
, by=.EACHI][, .(City, Year = YearFrom, N)]
City Year N
1: LA 2001 0
2: LA 2002 1
3: LA 2003 1
4: LA 2004 2
5: LA 2005 4
6: LA 2006 4
7: LA 2007 4
8: LA 2008 3
9: LA 2009 3
10: NY 2001 1
11: NY 2002 2
12: NY 2003 3
13: NY 2004 3
14: NY 2005 2
15: NY 2006 3
16: NY 2007 3
17: NY 2008 4
18: NY 2009 3
答案 1 :(得分:7)
较短的tidyverse
解决方案。
# Firsts some data prep
df <- mutate(df,
YearFrom = as.numeric(as.character(YearFrom)), #Fix year coding
YearTo = as.numeric(as.character(YearTo)),
YearTo = coalesce(YearTo, max(c(YearFrom, YearTo), na.rm = TRUE))) #Replace NA with max
df %>%
mutate(Years = map2(YearFrom, YearTo - 1, `:`)) %>% #Find all years
unnest() %>% #Spread over rows
count(Years, City) %>% #Count them
complete(City, Years, fill = list(n = 0)) #Add in zeros, if needed
答案 2 :(得分:5)
以下是使用data.table
的一个答案。数据准备工作在最底层。
# get list of businesses, one obs per year of operation
cityList <- lapply(seq_len(nrow(df)),
function(i) df[i, .(City, "Year"=seq(YearFrom, YearTo - 1))])
# combine to a single data.table
dfNew <- rbindlist(cityList)
# get counts
dfNew <- dfNew[, .(Count=.N), by=.(City, Year)]
写在一行,这是
# get the counts
rbindlist(lapply(seq_len(nrow(df)),
function(i) df[i, .(City, "Year"=seq(YearFrom, YearTo - 1))]))[, .(Count=.N),
by=.(City, Year)]
在这里,lapply
遍历每一行并构造一个data.table,其中重复的城市值为一列,第二列为操作年份。在这里,YearTo递减,因此它不包括关闭年份。请注意,在数据准备中,缺失值设置为2018,以便包含当前年份。
lapply
返回一个data.tables列表,它通过rbindlist
合并为一个data.table。此data.table汇总到城市 - 年对,并使用.N
构建计数。
这些回归
City Year Count
1: NY 2001 1
2: NY 2002 2
3: NY 2003 3
4: NY 2004 3
5: NY 2005 2
6: NY 2006 3
7: NY 2007 3
...
26: LA 2012 3
27: LA 2013 3
28: LA 2014 3
29: LA 2015 3
30: LA 2016 3
31: LA 2017 3
32: LA 2002 1
33: LA 2003 1
数据强>
setDT(df)
# convert string years to integers
df[, grep("Year", names(df), value=TRUE) :=
lapply(.SD, function(x) as.integer(as.character(x))), .SDcols=grep("Year", names(df))]
# replace NA values with 2018 (to include 2017 in count)
df[is.na(YearTo), YearTo := 2018]
答案 3 :(得分:2)
此解决方案使用dplyr
和tidyr
。
library(dplyr)
library(tidyr)
df %>%
# Change YearFrom and YearTo to numeric
mutate(YearFrom = as.numeric(as.character(YearFrom)),
YearTo = as.numeric(as.character(YearTo))) %>%
# Replace NA with 2017 in YearTo
mutate(YearTo = ifelse(is.na(YearTo), 2017, YearTo)) %>%
# All number in YearTo minus 1 to exclude the year of cancellation
mutate(YearTo = YearTo - 1) %>%
# Group by row
rowwise() %>%
# Create a tbl for each row, expand the Year column based on YearFrom and YearTo
do(data_frame(City = .$City, Year = seq(.$YearFrom, .$YearTo, by = 1))) %>%
ungroup() %>%
# Count the number of each City and Year
count(City, Year) %>%
# Rename the column n to Count
rename(Count = n) %>%
# Spread the data frame to find the implicity missing value in LA, 2001
spread(Year, Count) %>%
# Gather the data frame to account for the missing value in LA, 2001
gather(Year, Count, - City) %>%
# Replace NA with 0 in Count
mutate(Count = ifelse(is.na(Count), 0L, Count)) %>%
# Arrange the data
arrange(desc(City), Year) %>%
# Filter the data until 2009
filter(Year <= 2009)