将相应季节列添加到数据框

时间:2017-04-14 17:11:31

标签: r date time

以下是我的数据框的示例。我在R.工作。

date          name       count
2016-11-12    Joe         5
2016-11-15    Bob         5
2016-06-15    Nick        12
2016-10-16    Cate        6

我想在我的数据框中添加一列,告诉我与日期对应的季节。我希望它看起来像这样:

date          name       count      Season
2016-11-12    Joe         5          Winter
2016-11-15    Bob         5          Winter
2017-06-15    Nick        12         Summer
2017-10-16    Cate        6          Fall 

我已经开始了一些代码:

startWinter <- c(month.name[1], month.name[12], month.name[11])
startSummer <- c(month.name[5], month.name[6], month.name[7])
startSpring <- c(month.name[2], month.name[3], month.name[4])

# create a function to find the correct season based on the month
MonthSeason <- function(Month) {
  # !is.na()
 # ignores values with NA
  # match()
  # returns a vector of the positions of matches 
  # If the starting month matches a spring season, print "Spring". If the starting month matches a summer season, print "Summer" etc.  
  ifelse(!is.na(match(Month, startSpring)),
         return("spring"),
         return(ifelse(!is.na(match(Month, startWinter)),
                       "winter",
                       ifelse(!is.na(match(Month, startSummer)),
                              "summer","fall"))))
}

这段代码给了我一个月的季节。我不确定我是否以正确的方式解决这个问题。谁能帮我吗? 谢谢!

3 个答案:

答案 0 :(得分:2)

有几种黑客,它们的可用性取决于您是否要使用meteorological or astronomical seasons。我会提供两者,我认为它们提供了足够的灵活性。

我将使用您提供的第二个数据,因为它提供的不仅仅是“冬天”。

txt <- "date          name       count
2016-11-12    Joe         5
2016-11-15    Bob         5
2017-06-15    Nick        12
2017-10-16    Cate        6"
dat <- read.table(text = txt, header = TRUE, stringsAsFactors = FALSE)
dat$date <- as.Date(dat$date)

当季节严格按月定义时,最快的方法效果很好。

metseasons <- c(
  "01" = "Winter", "02" = "Winter",
  "03" = "Spring", "04" = "Spring", "05" = "Spring",
  "06" = "Summer", "07" = "Summer", "08" = "Summer",
  "09" = "Fall", "10" = "Fall", "11" = "Fall",
  "12" = "Winter"
)
metseasons[format(dat$date, "%m")]
#       11       11       06       10 
#   "Fall"   "Fall" "Summer"   "Fall" 

如果您选择使用未按月开始/停止定义的季节的日期范围,例如天文季节,那么这是另一个“黑客”:

astroseasons <- as.integer(c("0000", "0320", "0620", "0922", "1221", "1232"))
astroseasons_labels <- c("Winter", "Spring", "Summer", "Fall", "Winter")

如果您使用正确的DatePOSIX类型,那么您将包含多年,这会使事情变得不那么通用。人们可能会想到使用朱利安日期,但在闰年期间会产生异常。因此,假设2月28日永远不是季节性边界,我正在“数字化”月 - 日。即使R确实进行了字符比较,cut也需要数字,所以我们将它们转换为整数。

两个保护措施:因为cut要么是右开(和左关),要是右关(和左开),那么我们的两个书端需要扩展超越< / em>法定日期,ergo "0000""1232"。还有其他技术可以在这里同样有效(例如,使用-InfInf,后整合)。

astroseasons_labels[ cut(as.integer(format(dat$date, "%m%d")), astroseasons, labels = FALSE) ]
# [1] "Fall"   "Fall"   "Spring" "Fall"  

请注意,第三个日期是在春天使用天文季节和夏天。否则。

此解决方案可以轻松调整,以适应南半球或其他季节性偏好/信仰。

修改:受@Kristofersen's answer的推动(谢谢),我查看了基准测试。 lubridate::month使用POSIXct到 - POSIXlt转换来提取月份,这比我的format(x, "%m")方法快10倍。就这样:

metseasons2 <- c(
  "Winter", "Winter",
  "Spring", "Spring", "Spring",
  "Summer", "Summer", "Summer",
  "Fall", "Fall", "Fall",
  "Winter"
)

注意as.POSIXlt返回0个月,我们添加1:

metseasons2[ 1 + as.POSIXlt(dat$date)$mon ]
# [1] "Fall"   "Fall"   "Summer" "Fall"  

比较:

library(lubridate)
library(microbenchmark)
set.seed(42)
x <- Sys.Date() + sample(1e3)
xlt <- as.POSIXlt(x)

microbenchmark(
  metfmt = metseasons[ format(x, "%m") ],
  metlt  = metseasons2[ 1 + xlt$mon ],
  astrofmt = astroseasons_labels[ cut(as.integer(format(x, "%m%d")), astroseasons, labels = FALSE) ],
  astrolt  = astroseasons_labels[ cut(100*(1+xlt$mon) + xlt$mday, astroseasons, labels = FALSE) ],
  lubridate = sapply(month(x), seasons)
)
# Unit: microseconds
#       expr      min       lq       mean    median        uq       max neval
#     metfmt 1952.091 2135.157 2289.63943 2212.1025 2308.1945  3748.832   100
#      metlt   14.223   16.411   22.51550   20.0575   24.7980    68.924   100
#   astrofmt 2240.547 2454.245 2622.73109 2507.8520 2674.5080  3923.874   100
#    astrolt   42.303   54.702   72.98619   66.1885   89.7095   163.373   100
#  lubridate 5906.963 6473.298 7018.11535 6783.2700 7508.0565 11474.050   100

因此使用as.POSIXlt(...)$mon的方法要快得多。 (@ Kristofersen的答案可以通过对其进行矢量化来改进,可能使用ifelse,但仍然无法与使用或不使用cut的矢量查找速度进行比较。)

答案 1 :(得分:1)

如果你的数据是df:

# create dataframe for month and corresponding season
dfSeason <- data.frame(season = c(rep("Winter", 3), rep("Summer", 3), 
rep("Spring", 3), rep("Fall", 3)),
                   month = month.name[c(11,12,1, 5:7, 2:4, 8:10)],
                   stringsAsFactors = F)

# make date as date
df$data <- as.Date(df$date)

# match the month of the date in df (format %B) with month in season
# then use it to index the season of dfSeason
df$season <- dfSeason$season[match(format(df$data, "%B"), dfSeason$month)]

答案 2 :(得分:1)

您可以使用lubridate快速完成此操作,并将月份数量更改为一个季节。

library(lubridate)

seasons = function(x){
  if(x %in% 2:4) return("Spring")
  if(x %in% 5:7) return("Summer")
  if(x %in% 8:10) return("Fall")
  if(x %in% c(11,12,1)) return("Winter")

}

dat$Season = sapply(month(dat$date), seasons)

> dat
        date name count Season
1 2016-11-12  Joe     5 Winter
2 2016-11-15  Bob     5 Winter
3 2016-06-15 Nick    12 Summer
4 2016-10-16 Cate     6   Fall