重新排列数据:从水年转换为日历年

时间:2016-10-24 03:18:29

标签: r date time-series tidyr

我有一个表格,其中包含来自流量计的数据:

  Water.Year   May   Jun   Jul   Aug    Sep    Oct    Nov   Dec   Jan   Feb   Mar   Apr 
1  1953-1954 55.55 43.62 30.46 26.17  26.76  41.74  19.92 41.25 28.77 20.96 12.47 10.51
2  1954-1955 23.49 81.35 46.71 29.33  67.83 133.30  37.62 30.16 21.07 19.38 13.87 10.63
3  1955-1956  9.87 51.59 55.36 63.03 154.08  98.15 104.06 32.85 22.89 17.30 15.68 10.88

> data <- structure(list(Water.Year = structure(1:6, .Label = c("1953-1954", "1954-1955", "1955-1956", "1956-1957", "1957-1958", "1958-1959", "1959-1960", "1960-1961", "1961-1962", "1962-1963", "1963-1964", "1964-1965", "1965-1966", "1966-1967", "1967-1968", "1968-1969", "1969-1970", "1970-1971", "1971-1972", "1972-1973", "1973-1974", "1974-1975", "1975-1976", "1976-1977", "1977-1978", "1978-1979", "1979-1980", "1980-1981", "1981-1982", "1982-1983", "1983-1984", "1984-1985", "1985-1986", "1986-1987", "1987-1988", "1988-1989", "1989-1990", "1990-1991", "1991-1992", "1992-1993", "1993-1994", "1994-1995", "1995-1996", "1996-1997", "1997-1998", "1998-1999", "1999-2000", "2000-2001"), class = "factor"), May = c(55.55, 23.49, 9.87, 18.03, 17.46, 11.37), Jun = c(43.62, 81.35, 51.59, 28.61, 15.14, 29.48), Jul = c(30.46, 46.71, 55.36, 24.36, 20.09, 19.48), Ago = c(26.17, 29.33, 63.03, 22.01, 16.97, 16.86), Set = c(26.76, 67.83, 154.08, 28.51, 27.24, 21.01), Oct = c(41.74, 133.3, 98.15, 53.72, 35.78, 19.78), Nov = c(19.92, 37.62, 104.06, 115.78, 20.35, 18.69), Dic = c(41.25, 30.16, 32.85, 32.04, 22, 18.86), Ene = c(28.77, 21.07, 22.89, 25.44, 13.27, 14.89), Feb = c(20.96, 19.38, 17.3, 14.53, 10.37, 10.4), Mar = c(12.47, 13.87, 15.68, 10.78, 8.77, 8.79), Abr = c(10.51, 10.63, 10.88, 9.33, 7.69, 8.99)), .Names = c("Water.Year", "May", "Jun", "Jul", "Ago", "Set", "Oct", "Nov", "Dic", "Ene", "Feb", "Mar", "Abr"), row.names = c(NA, 6L), class = "data.frame")

按照“水年”安排,每年从5月开始到明年4月结束(这可以在第1栏看到)。 我想将其转换为包含三列的数据框: Calendar.Year - - Flow.Measurement

我已经将 Water.Year 列分解为使用“分离”来自tidyr的两列:

> df = separate(data, Water.Year, c("year1","year2"))

   year1 year2   May   Jun   Jul   Aug    Sep    Oct    Nov   Dec   Jan   Feb   Mar   Apr 
 1  1953  1954 55.55 43.62 30.46 26.17  26.76  41.74  19.92 41.25 28.77 20.96 12.47 10.51
 2  1954  1955 23.49 81.35 46.71 29.33  67.83 133.30  37.62 30.16 21.07 19.38 13.87 10.63

现在我打算使用tidyr的“聚集”来完成转换的其余部分,但我仍然坚持如何使用 year1 <创建 Calendar.Year 列/ em>列可以 Dec year2 Jan Apr

任何帮助将不胜感激。

4 个答案:

答案 0 :(得分:3)

另一个想法(使用带有英语月份的@useR数据)

library(dplyr)
library(tidyr)


df %>%
  separate(Water.Year, c("Year1", "Year2")) %>%
  gather(Month, Value, -(Year1:Year2)) %>%
  group_by(Year1, Year2) %>%
  mutate(Year = if_else(match(Month, month.abb) >= 5, Year1, Year2),
         Month = factor(Month, levels = month.abb)) %>%
  ungroup() %>%
  select(Year, Month, Value) %>%
  arrange(Year, Month)

我们将Water.Year列分为Year1Year2,然后使用gather()将数据重新整形为长格式。然后,对于每个组,我们使用match()month.abb来检查月份是否大于或等于5(5月),并将相应的年份指定为if_else()。最后,我们删除了arrange()Year

之后的不必要列和Month
## A tibble: 36 × 3
#    Year  Month Value
#   <chr> <fctr> <dbl>
#1   1953    May 55.55
#2   1953    Jun 43.62
#3   1953    Jul 30.46
#4   1953    Aug 26.17
#5   1953    Sep 26.76
#6   1953    Oct 41.74
#7   1953    Nov 19.92
#8   1953    Dec 41.25
#9   1954    Jan 28.77
#10  1954    Feb 20.96
## ... with 26 more rows

答案 1 :(得分:1)

好的,这个怎么样。它是重塑和基础R之间的混搭。

发布后我使用了您的数据集。谢谢你提供它。

data <- structure(list(Water.Year = structure(1:6, .Label = c("1953-1954", "1954-1955", "1955-1956", "1956-1957", "1957-1958", "1958-1959", "1959-1960", "1960-1961", "1961-1962", "1962-1963", "1963-1964", "1964-1965", "1965-1966", "1966-1967", "1967-1968", "1968-1969", "1969-1970", "1970-1971", "1971-1972", "1972-1973", "1973-1974", "1974-1975", "1975-1976", "1976-1977", "1977-1978", "1978-1979", "1979-1980", "1980-1981", "1981-1982", "1982-1983", "1983-1984", "1984-1985", "1985-1986", "1986-1987", "1987-1988", "1988-1989", "1989-1990", "1990-1991", "1991-1992", "1992-1993", "1993-1994", "1994-1995", "1995-1996", "1996-1997", "1997-1998", "1998-1999", "1999-2000", "2000-2001"), class = "factor"), May = c(55.55, 23.49, 9.87, 18.03, 17.46, 11.37), Jun = c(43.62, 81.35, 51.59, 28.61, 15.14, 29.48), Jul = c(30.46, 46.71, 55.36, 24.36, 20.09, 19.48), Ago = c(26.17, 29.33, 63.03, 22.01, 16.97, 16.86), Set = c(26.76, 67.83, 154.08, 28.51, 27.24, 21.01), Oct = c(41.74, 133.3, 98.15, 53.72, 35.78, 19.78), Nov = c(19.92, 37.62, 104.06, 115.78, 20.35, 18.69), Dic = c(41.25, 30.16, 32.85, 32.04, 22, 18.86), Ene = c(28.77, 21.07, 22.89, 25.44, 13.27, 14.89), Feb = c(20.96, 19.38, 17.3, 14.53, 10.37, 10.4), Mar = c(12.47, 13.87, 15.68, 10.78, 8.77, 8.79), Abr = c(10.51, 10.63, 10.88, 9.33, 7.69, 8.99)), .Names = c("Water.Year", "May", "Jun", "Jul", "Ago", "Set", "Oct", "Nov", "Dic", "Ene", "Feb", "Mar", "Abr"), row.names = c(NA, 6L), class = "data.frame")

我决定使用您之前的年份信息,并在此基础上添加日历年。因为我们知道5月到12月是1年级,1月到4月是2年级。也许有点复杂但是它完成了工作。

df = separate(data, Water.Year, c("year1","year2"))

library(reshape2)

fixDF<-melt(df)


fixDF$CalendarYear<-rep(NA,nrow(fixDF))

fixDF$CalendarYear[min(which(fixDF$variable=="May")):max(which(fixDF$variable=="Dic"))]<-df$year1

fixDF$CalendarYear[min(which(fixDF$variable=="Ene")):max(which(fixDF$variable=="Abr"))]<-df$year2

fixDF<-fixDF[,3:5]

colnames(fixDF)<-c("Month","Flow.Measurement", "Calendar.Year")

答案 2 :(得分:1)

好的,我刚刚意识到您在structure()中提供的月份可能使用不同的语言。我将坚持使用我创建的数据,它使用英文版的Months。这样人们就可以用英语看到相应的解决方案。

library(tidyr) # for separate function
library(reshape2) # for melt function
library(dplyr) # for pipe operator and arrange function

# Reproducible Data
weather = structure(list(Water.Year = structure(1:3, .Label = c("1953-1954", 
                                                      "1954-1955", "1955-1956"), class = "factor"), 
                         May = c(55.55, 23.49, 9.87), 
                         Jun = c(43.62, 81.35, 51.59), 
                         Jul = c(30.46, 46.71, 55.36), 
                         Aug = c(26.17, 29.33, 63.03), 
                         Sep = c(26.76, 67.83, 154.08), 
                         Oct = c(41.74, 133.3, 98.15), 
                         Nov = c(19.92, 37.62, 104.06), 
                         Dec = c(41.25, 30.16, 32.85), 
                         Jan = c(28.77, 21.07, 22.89), 
               Feb = c(20.96, 19.38, 17.3), Mar = c(12.47, 13.87, 15.68), 
               Apr = c(10.51, 10.63, 10.88)), .Names = c("Water.Year", "May", 
                                                         "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec", "Jan", "Feb", 
                                                         "Mar", "Apr"), class = "data.frame", row.names = c(NA, -3L))

编码从这里开始:

df = separate(weather, Water.Year, c("year1","year2"))

# Split into two datasets
columns1 = c("year1", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Dec")

df1 = subset(df, select = c(year1, May:Dec))
df2 = subset(df, select = c(year2, Jan:Apr))

longdf1 = melt(df1, variable.name = "Month", id.vars = "year1",
               value.name = "Flow.Measurement") 
names(longdf1)[1] = "Calendar.Year"
longdf2 = melt(df2, variable.name = "Month", id.vars = "year2",
               value.name = "Flow.Measurement") 
names(longdf2)[1] = "Calendar.Year"

# Combine the two datasets
final_df = rbind(longdf1, longdf2)

# Releveling the Month
final_df$Month = factor(final_df$Month, levels = month.abb)

final_df = arrange(final_df, Calendar.Year, Month)

最终数据框:

> final_df
   Calendar.Year Month Flow.Measurement
1           1953   May            55.55
2           1953   Jun            43.62
3           1953   Jul            30.46
4           1953   Aug            26.17
5           1953   Sep            26.76
6           1953   Oct            41.74
7           1953   Nov            19.92
8           1953   Dec            41.25
9           1954   Jan            28.77
10          1954   Feb            20.96
11          1954   Mar            12.47
12          1954   Apr            10.51
13          1954   May            23.49
14          1954   Jun            81.35
15          1954   Jul            46.71
16          1954   Aug            29.33
17          1954   Sep            67.83
18          1954   Oct           133.30
19          1954   Nov            37.62
20          1954   Dec            30.16
21          1955   Jan            21.07
22          1955   Feb            19.38
23          1955   Mar            13.87
24          1955   Apr            10.63
25          1955   May             9.87
26          1955   Jun            51.59
27          1955   Jul            55.36
28          1955   Aug            63.03
29          1955   Sep           154.08
30          1955   Oct            98.15
31          1955   Nov           104.06
32          1955   Dec            32.85
33          1956   Jan            22.89
34          1956   Feb            17.30
35          1956   Mar            15.68
36          1956   Apr            10.88

答案 3 :(得分:0)

我决定使用我得到的所有答案的一些部分。 这是我写的代码:

library(dplyr)
library(tidyr)

#separate the year column into two years
df_years <- df %>%
  separate(Water.Year, c("Year1", "Year2")) 

#create two different dataframes for each section of the year
df1 <- subset(df_years, select = c(Year1, May:Dec))
df2 <- subset(df_years, select = c(Year2, Jan:Apr))

#rename both years' columns using the same name
colnames(df2)[1] <- "Year"
colnames(df1)[1] <- "Year"

#join both dataframes
cleandata <- full_join(df1, df2, by = "Year")

#sort months chronologically
cleandata <- cleandata[, c("Year", "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")]

#convert to tidy data set
cleandata <- gather(cleandata, "Month", "Flow", 2:13)

#sort by year and month
cleandata <- arrange(cleandata, Year, Month)