展开由“from”和“to”列定义的范围

时间:2012-07-15 18:30:16

标签: r dataframe

我的数据框包含"name"美国总统,他们开始和结束的年份("from""to"列)。这是一个示例:

name           from  to
Bill Clinton   1993 2001
George W. Bush 2001 2009
Barack Obama   2009 2012

...以及dput的输出:

dput(tail(presidents, 3))
structure(list(name = c("Bill Clinton", "George W. Bush", "Barack Obama"
), from = c(1993, 2001, 2009), to = c(2001, 2009, 2012)), .Names = c("name", 
"from", "to"), row.names = 42:44, class = "data.frame")

我想创建包含两列("name""year")的数据框,每年都有一位总统在职的行。因此,我需要每年从" from"到"to"创建一个常规序列。这是我预期的结果:

name           year
Bill Clinton   1993
Bill Clinton   1994
...
Bill Clinton   2000
Bill Clinton   2001
George W. Bush 2001
George W. Bush 2002
... 
George W. Bush 2008
George W. Bush 2009
Barack Obama   2009
Barack Obama   2010
Barack Obama   2011
Barack Obama   2012

我知道我可以使用data.frame(name = "Bill Clinton", year = seq(1993, 2001))扩展单个总统的事情,但我无法弄清楚如何为每个总统进行迭代。

我该怎么做?我觉得我应该知道这一点,但我要画一个空白。

更新1

好的,我已尝试过两种解决方案,但我收到了错误消息:

foo<-structure(list(name = c("Grover Cleveland", "Benjamin Harrison", "Grover Cleveland"), from = c(1885, 1889, 1893), to = c(1889, 1893, 1897)), .Names = c("name", "from", "to"), row.names = 22:24, class = "data.frame")
ddply(foo, "name", summarise, year = seq(from, to))
Error in seq.default(from, to) : 'from' must be of length 1

9 个答案:

答案 0 :(得分:13)

这是一个data.table解决方案。它有很好的(如果是次要的)将总统留在他们提供的订单中的功能:

library(data.table)
dt <- data.table(presidents)
dt[, list(year = seq(from, to)), by = name]
#               name year
#  1:   Bill Clinton 1993
#  2:   Bill Clinton 1994
#  ...
#  ...
# 21:   Barack Obama 2011
# 22:   Barack Obama 2012

编辑:要处理非连续字词的总统,请改用:

dt[, list(year = seq(from, to)), by = c("name", "from")]

答案 1 :(得分:12)

您可以使用plyr包:

library(plyr)
ddply(presidents, "name", summarise, year = seq(from, to))
#              name year
# 1    Barack Obama 2009
# 2    Barack Obama 2010
# 3    Barack Obama 2011
# 4    Barack Obama 2012
# 5    Bill Clinton 1993
# 6    Bill Clinton 1994
# [...]

如果数据按年份排序很重要,您可以使用arrange函数:

df <- ddply(presidents, "name", summarise, year = seq(from, to))
arrange(df, df$year)
#              name year
# 1    Bill Clinton 1993
# 2    Bill Clinton 1994
# 3    Bill Clinton 1995
# [...]
# 21   Barack Obama 2011
# 22   Barack Obama 2012

编辑1:关注@ edgester的“更新1”,更合适的方法是使用adply来计算具有非连续术语的总统:

adply(foo, 1, summarise, year = seq(from, to))[c("name", "year")]

答案 2 :(得分:5)

这是一个dplyr解决方案:

library(dplyr)

# the data
presidents <- 
structure(list(name = c("Bill Clinton", "George W. Bush", "Barack Obama"
), from = c(1993, 2001, 2009), to = c(2001, 2009, 2012)), .Names = c("name", 
"from", "to"), row.names = 42:44, class = "data.frame")

# the expansion of the table
presidents %>%
    rowwise() %>%
    do(data.frame(name = .$name, year = seq(.$from, .$to, by = 1)))

# the output
Source: local data frame [22 x 2]
Groups: <by row>

             name  year
            (chr) (dbl)
1    Bill Clinton  1993
2    Bill Clinton  1994
3    Bill Clinton  1995
4    Bill Clinton  1996
5    Bill Clinton  1997
6    Bill Clinton  1998
7    Bill Clinton  1999
8    Bill Clinton  2000
9    Bill Clinton  2001
10 George W. Bush  2001
..            ...   ...

h / t:https://stackoverflow.com/a/24804470/1036500

答案 3 :(得分:2)

另一个base解决方案:

l <- mapply(`:`, d$from, d$to)
data.frame(name = d$name[rep(1:nrow(d), lengths(l))], year = unlist(l))
#              name year
# 1    Bill Clinton 1993
# 2    Bill Clinton 1994
# ...snip
# 8    Bill Clinton 2000
# 9    Bill Clinton 2001
# 10 George W. Bush 2001
# 11 George W. Bush 2002
# ...snip
# 17 George W. Bush 2008
# 18 George W. Bush 2009
# 19   Barack Obama 2009
# 20   Barack Obama 2010
# 21   Barack Obama 2011
# 22   Barack Obama 2012

答案 4 :(得分:1)

以下是一个快速基础 - R解决方案,其中Df是您的data.frame

do.call(rbind, apply(Df, 1, function(x) {
  data.frame(name=x[1], year=seq(x[2], x[3]))}))

它提供了有关行名称的一些警告,但似乎返回正确的data.frame

答案 5 :(得分:0)

使用tidyverse的另一种选择是将gather数据转换成长格式group_by name并在fromto之间创建一个序列日期。

library(tidyverse)

presidents %>%
  gather(key, date, -name) %>%
  group_by(name) %>%
  complete(date = seq(date[1], date[2]))%>%
  select(-key) 

# A tibble: 22 x 2
# Groups:   name [3]
#   name          date
#   <chr>        <dbl>
# 1 Barack Obama  2009
# 2 Barack Obama  2010
# 3 Barack Obama  2011
# 4 Barack Obama  2012
# 5 Bill Clinton  1993
# 6 Bill Clinton  1994
# 7 Bill Clinton  1995
# 8 Bill Clinton  1996
# 9 Bill Clinton  1997
#10 Bill Clinton  1998
# … with 12 more rows

答案 6 :(得分:0)

使用tidyverseunnest的另一种map2方法。

library(tidyverse)

presidents %>%
  unnest(year = map2(from, to, seq)) %>%
  select(-from, -to)

#              name  year
# 1    Bill Clinton  1993
# 2    Bill Clinton  1994
...
# 21   Barack Obama  2011
# 22   Barack Obama  2012

答案 7 :(得分:0)

使用by创建一个by数据帧的列表L,每个总裁一个数据帧,然后rbind在一起。不使用任何软件包。

L <- by(presidents, presidents$name, with, data.frame(name, year = from:to))
do.call("rbind", setNames(L, NULL))

如果您不介意行名,那么最后一行可以简化为:

do.call("rbind", L)

答案 8 :(得分:0)

使用dplyrtidyr的另一种解决方案:

library(magrittr) # for pipes
df <- data.frame(tata = c('toto1', 'toto2'), from = c(2000, 2004), to = c(2001, 2009))

#    tata from   to
# 1 toto1 2000 2001
# 2 toto2 2004 2009

df %>% 
  dplyr::as.tbl() %>%
  dplyr::rowwise() %>%
  dplyr::mutate(combined = list(seq(from, to))) %>%
  dplyr::select(-from, -to) %>%
  tidyr::unnest(combined)

#   tata  combined
#   <fct>    <int>
# 1 toto1     2000
# 2 toto1     2001
# 3 toto2     2004
# 4 toto2     2005
# 5 toto2     2006
# 6 toto2     2007
# 7 toto2     2008
# 8 toto2     2009