Question

我正在使用R来处理averageif()和maxif()类似的函数，因为我的数据集太大而Excel一直在崩溃。

我正在寻找一种基于average以及来自{{1}的wind和status列之间找到Start Date End Date的方法}。我想这会涉及R中的df函数。

between

（注意：我的实际问题是结算数据，因此我需要一个解决方案来收集基于ID的平均值，在开始结算日期和结束结算日期之间）。

以status<-c("tropical depression", "tropical depression", "tropical storm") Begin_Date <- as.Date(c("1974/06/01","1980/06/05","1990/06/07")) End_Date <- as.Date(c("1975/06/01","1981/07/05","1991/08/07")) df<-as.data.frame(cbind(status,Begin_Date,End_Date)) df$Begin_Date<-as.Date(Begin_Date) df$End_Date<-as.Date(End_Date) df$status<-as.character(status) storms$date <- as.Date(with(storms, paste(year, month, day,sep="-")), "%Y-%m-%d")中的storms数据集为例。

从我的dplyr开始：

df

我想要的是什么：

               status Begin_Date   End_Date
tropical depression  1974-06-01 1975-06-01
 tropical depression 1980-06-05 1981-07-05
      tropical storm 1990-06-07 1991-08-07

我也试图让这个status Begin_Date End_Date Avg Wind tropical depression 1974-06-01 1975-06-01 44.3 tropical depression 1980-06-05 1981-07-05 66.7 tropical storm 1990-06-07 1991-08-07 56兼容。

我的尝试是错误的：

dplyr

“我喜欢”示例中的平均风和最大风值不准确，仅用于格式化目的。

Answer 1

好的 - 我发布了一个新的答案，因为你现在指定你想要dplyr。如果您没有转换为日期，那么这可能会更容易 - 只需创建一个数字字符串

x <- storms
x$date <- as.Date(with(storms, paste(year, month, day,sep="-")), "%Y-%m-%d")

    # with filter
    x %>% 
    filter( date  > as.Date("1975-06-01") & date < as.Date("1976-06-01") ) %>% 
    group_by(  status ) %>%
    summarise(Avg.Win=mean(wind, na.rm=TRUE))

    #with mutuate
    x %>% 
    mutate( times = cut( date , breaks= c( as.Date("1975-06-01")  , as.Date("1976-06-01"), as.Date("1978-06-01")) ) ) %>% 
    group_by( times, status ) %>%
    summarise(Avg.Win=mean(wind, na.rm=TRUE))

Answer 2

正如评论中所述：只需left_join(storms, your_data)（在status上）和filter，即年份不在您的范围内的行。

如果您对其他工具持开放态度，data.table支持非等连接，这对大数据的效率会大大提高。

    left_join(storms, df, by = "status") %>%
        filter(Begin_Date <= date & date <= End_Date) %>%
        group_by(Begin_Date, End_Date, status) %>%
        summarize(avg_wind = mean(wind))
    # # A tibble: 2 x 4
    # # Groups: Begin_Date, End_Date [?]
    #   Begin_Date End_Date   status              avg_wind
    #   <date>     <date>     <chr>                  <dbl>
    # 1 1980-06-05 1981-07-05 tropical depression     26.9
    # 2 1990-06-07 1991-08-07 tropical storm          45.4

结果中只有2行，因为1974-06-01和1975-06-01之间的storms数据显然没有热带低压。事实上，storms中最小的日期是1975-06-27。

您似乎非常热衷于使用between。如果您愿意，可以在filter()内使用它而不是我的代码。它不会改变结果。

Answer 3

这是其中一项有很多方法可做的事情。这里有一些基本的选项

# Using Indexing
x <- data.frame( storms )
x$wind <- as.numeric( x$wind ) 
mean(  x[ x$year %in% 1979:1980 & x$status %in% "hurricane"  , "wind" ]  , na.rm=T )
max(  x[ x$year %in% 1979:1980 & x$status %in% "hurricane"  , "wind" ]  , na.rm=T )

# using aggregate
x$groups <- cut( x$year , c(-Inf , 1979, 1981 , 1985 , Inf ))
x$groups_type <- paste( x$groups , x$status)
aggregate ( x$wind,by= list(x$groups_type) , mean, na.rm=T)
aggregate ( x$wind,by= list(x$groups_type) , max, na.rm=T)

R中的averageif（）等价物

3 个答案: