以下是我的数据示例。我正在尝试为数据表创建数据,其中使用dcast函数后,数据必须以非常特定的顺序排列。我也在尝试计算某些列之间的差异。目标是按状态,区域,1_2017、1_2018、1_diff,2_2017、2_2018、2_diff等的顺序获取数据。
我试图通过专门调用每一列来计算差异并对列进行排序,但这似乎是一种非常差的方法,尤其是当我的实际数据超过50列时。下面是我的示例数据以及我一直在使用的逻辑。
library(reshape2)
library(dplyr)
#Data
data<-data.frame("State"=c("AK","AK","AK","AK","AK","AK","AK","AK","AR","AR","AR","AR","AR","AR","AR","AR"),
"StoreRank" = c(1,1,1,1,2,2,2,2,1,1,1,1,2,2,2,2),
"Year" = c(2017,2018,2017,2018,2017,2018,2017,2018,2017,2018,2017,2018,2017,2018,2017,2018),
"Region" = c("East","East","West","West","East","East","West","West","East","East","West","West","East","East","West","West"),
"Store" = c("Ingles","Ingles","Ingles","Ingles","Safeway","Safeway","Safeway","Safeway","Albertsons","Albertsons","Albertsons","Albertsons","Safeway","Safeway","Safeway","Safeway"),
"Total" = c(500000,520000,480000,485000,600000,600000,500000,515000,500100,520100,480100,485100,601010,601000,501000,515100))
#Formatting data for Data table
data<-dcast(data, State+Region~StoreRank+Year, value.var = 'Total')
#Function to calculate difference between columns
diff_calculation <- function(data) {
mutate(data,
`1_diff` = data$`1_2018`-data$`1_2017`,
`2_diff` = data$`2_2018`-data$`2_2017`)}
#Applying difference calculation function
reform.data<-diff_calculation(data)
#Changes the column names from numbers to letter to try and order columns
names(reform.data)<-gsub(x = colnames(reform.data), pattern="1_", replacement = "a_")
names(reform.data)<-gsub(x = colnames(reform.data), pattern="2_", replacement = "b_")
#Trying to order columns as State, Region, 1_2017, 1_2018, 1_diff, 2_2017, 2_2018, 2_diff, etc.
ordered.data<-reform.data[,order(names(reform.data))]
final.data<-ordered.data %>%
select('State', 'Region', 'a_2017', 'a_2018', 'a_diff', 'b_2017', 'b_2018', 'b_diff')
在将dcast函数应用于包含大量列的数据之后,我希望找到一种更好的方法来计算列与顺序列之间的差异。
答案 0 :(得分:0)
一种方法是使用长格式处理此问题,例如与tidyverse
:
library(tidyverse)
long_format <- data %>%
mutate(
StoreRank = ifelse(StoreRank == 1, "a", "b"),
diff_col = paste(StoreRank, "diff", sep = "_"),
Year = paste(StoreRank, Year, sep = "_")
) %>% group_by(State, Region, StoreRank) %>%
mutate(diff = Total - lag(Total)) %>%
fill(diff, .direction = "up") %>% ungroup()
final_df <- bind_rows(
long_format %>% select(State, Region, Year, Total),
long_format %>% select(State, Region, Year = diff_col, Total = diff)) %>%
arrange(Year) %>%
rowid_to_column %>%
spread(Year, Total) %>%
group_by(State, Region) %>%
summarise_all(funs(first(na.omit(.)))) %>%
select(-rowid)
输出:
# A tibble: 4 x 8
# Groups: State [2]
State Region a_2017 a_2018 a_diff b_2017 b_2018 b_diff
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AK East 500000 520000 20000 600000 600000 0
2 AK West 480000 485000 5000 500000 515000 15000
3 AR East 500100 520100 20000 601010 601000 -10
4 AR West 480100 485100 5000 501000 515100 14100