如何在R中按季度总结

时间:2018-06-22 14:59:49

标签: r

在汇总R中数据库中的数据时遇到一些困难。我希望提取数据并按Quarter进行汇总。

下面是我用于获取txt输出的代码,但出现错误。

我需要做什么来操纵代码以运行此代码,以便可以按季度对数据进行汇总?

library(data.table, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)

################
## PARAMETERS ##
################

# Set path of major source folder for raw transaction data
in_directory <- "C:/Users/name/Documents/Raw Data/"

# List names of sub-folders (currently grouped by first two characters of 
CUST_ID)
in_subfolders <- list("AA-CA", "CB-HZ", "IA-IL", "IM-KZ", "LA-MI", "MJ-MS",
                  "MT-NV", "NW-OH", "OI-PZ", "QA-TN", "TO-UZ",
                  "VA-WA", "WB-ZZ")

# Set location for output
out_directory <- "C:/Users/name/Documents/YTD Master/"
out_filename <- "NEW.csv"

# Set beginning and end of date range to be collected - year-month-day format
date_range <- interval(as.Date("2018-01-01"), as.Date("2018-05-31"))

# Enable or disable filtering of raw files to only grab items bought within 
certain months to save space.
# If false, all files will be scanned for unique items, which will take 
longer and be a larger file.
date_filter <- TRUE


##########
## CODE ##
##########

starttime <- Sys.time()
mastertable <- NULL

for (j in 1:length(in_subfolders)) {
  subfolder <- in_subfolders[j]
  sub_directory <- paste0(in_directory, subfolder, "/")

  ## IMPORT DATA
  in_filenames <- dir(sub_directory, pattern =".txt")

  for (i in 1:length(in_filenames)) {

    # Default value provided for when fast filtering is disabled.
    read_this_file <- TRUE

    # To fast filter the data, we choose to include or exclude an entire file 
based on the date of its first line.
    # WARNING: This is only a valid method if filtering by entire months, 
since that is the amount of data housed in each file.
    if (date_filter) {
      temptable <- fread(paste0(sub_directory, in_filenames[i]), 
colClasses=c(CUSTOMER_TIER = "character"),
                     na.strings = "", nrows = 1)
      temptable[, INVOICE_DT := as.Date(INVOICE_DT)]

      # If date matches, set read flag to TRUE.  If date does not match, set 
read flag to FALSE.
  read_this_file <- temptable[, INVOICE_DT] %within% date_range
}


if (read_this_file) {
  print(Sys.time()-starttime)
  print(paste0("Reading in ", in_filenames[i]))
  temptable <- fread(paste0(sub_directory, in_filenames[i]), colClasses=c(CUSTOMER_TIER = "character"),
                     na.strings = "")
  temptable <- temptable[, lapply(.SD, sum), by = quarter(INVOICE_DT),
                         .SDcols = c("INV_ITEM_ID","Ext Sale", "Ext Total Cost", "CE100", "CE110","CE120","QTY_SOLD","PACKSLIP_WHSL")]

  # Combine into full list
  mastertable <- rbindlist(list(mastertable, temptable), use.names = TRUE)
  # Release unneeded memory
  rm(temptable)

}

 }

}

# Save Final table
print("Saving master table")
fwrite(mastertable, paste0(out_directory, out_filename))
rm(mastertable)

print(Sys.time()-starttime)

运行此脚本后,以下是我收到的错误消息。

gsum(INV_ITEM_ID)中的错误:   GForce总和(gsum)不支持键入“字符”。添加前缀base :: sum(。)或使用options(datatable.optimize = 1)关闭GForce优化

1 个答案:

答案 0 :(得分:0)

这是一些通用数据的通用方法。

library(tidyverse)
library(lubridate)
data.frame(date = seq(as.Date('2010-01-12'), as.Date('2018-02-03'), by = 100),
                 var = runif(30)) %>%
  group_by(quarter(date, with_year = T)) %>%
  summarize(average_var = mean(var))

如果您不关心年份之间的差异,则可以省略“ with_year = T”。