Question

我正在阅读大量包含每个产品月度价格信息的文件。

我想获得一个合并所有这些文件的数据表。

此表的键将是带有产品标识符和日期的2列。

然后第三列包含零售价。

在源文件中，每个价格列都有一个格式为RETAILPRICE_ [dd.mm.yyyy]的名称。

为防止最终数据表中包含大量列，我需要使用零售价重命名该列并创建一个包含日期的新列。

以下代码遇到错误，因为containers: - name: nginx imagePullPolicy: Never image: custom-nginx ports: - containerPort: 80无法理解对其列之一的外部引用。

data.table

这会导致错误消息

# this is how I obtain the list of files that have to be read in
# list the files
# files <- list.files(path = "path",
#                    pattern = "^Publications.*$",
#                    full.names = T)

# the data looks like this, although it is contained in an excel file.
# sample data
ProdID <- list(836187, 2398159, 2398165, 2398171, 2398188, 1800180, 2320105, 2320128, 2320140, 2320163, 1714888, 2516340)
RETAILPRICE_01.01.2003 <- c(12.50, 43.50, 65.50, 45.60, 69.45, 21.30, 81.15, 210.70, 405.00, 793.60, 116.50, 162.60)
Publications_per_2003.01.01 <- data.table(ProdID,RETAILPRICE_01.01.2003)

# uncomment if you want to write this to excel
# using .xls on purpose, because that's what they used back in the days
# xlsx::write.xlsx(Publications_per_2003.01.01,
#    "Publications_per_2003.01.01.xls",
#    row.names = F)
# files <- list.files(path = "path",
#                    pattern = "^Publications.*$",
#                    full.names = T)

# create data table
price_list <- data.table(
                 prodID = character(),
                 date = character(),
                 retail_price = numeric())


price_list <- lapply(files, function(x){

  # obtain date from file name
  # date in file name has the structure yyyy_mm_dd
  # while in the column name date has the structure
  # dd.mm.yyyy
  date <- substr(sapply(strsplit(x,"_"),"[",3),1,10)

  # obtain day, month and year separately
  day <- substr(date,9,10)
  month <- substr(date,6,7)
  year <- substr(date,1,4)

  # store the name of the column containing the retail price
  priceVar <- as.name(paste0("RETAILPRICE_",day,".",month,".",year))

  # read the xls file with the price info and in one go
  # keep only the relevant columns
  file <- data.table(read_excel(x))[
    ,.(prodID= as.character(ProdID),
       retail_price = priceVar,
       date = as.character(gsub("\\.","-",date)))#,with = F
    ]

  # merge the new file with the existing data table
  price_list <- merge(price_list,file,"ProdID")
})

如果我对此部分发表评论

Error in rep(x[[i]], length.out = mn) : 
  attempt to replicate an object of type 'symbol'

没有错误。

因此问题出在对无法正常工作的列的引用上。

我也尝试过

retail_price = priceVar,

但是我得到了错误（列名已修改为适合示例）：

priceVar <- as.name(paste0("RETAILPRICE_",day,".",month,".",year))

file <- data.table(read_excel(x))

setnames(file, priceVar, "retail_price")

如果有人能启发我，我将永远感激不已。

Answer 1

如果您提供要使用的数据的样本，可能会很好，因此我们可以尝试使用数据样本的代码。我也阅读了您的代码，并在这一行上：

price_list <- merge(prijslijst,file,"ProdID")

您从未提到变量“ prijslijst”，所以问题可能出在这里。

Answer 2

在这种情况下，使用纯数据帧而不是使用data.table会容易得多。

price_list <- lapply(files, function(x){
  date <- substr(sapply(strsplit(x,"_"),"[",3),1,10)

  day <- substr(date,9,10)
  month <- substr(date,6,7)
  year <- substr(date,1,4)

  # make it a character, not a name
  priceVar <- paste0("RETAILPRICE_",day,".",month,".",year)

  one_df <- readxl::read_excel(x)[, c("ProdID", priceVar)]
  colnames(one_df) <- c("prodID", "retail_price")
  one_df$prodID = as.character(one_df$prodID) # NB: as.integer would be much more efficient, but be careful for values above 2.0e9
  one_df$date = as.character(gsub("\\.","-",date))

  one_df
})

# Watch out: this will pile up the records from all files
# In your initial code you were using merge(...) which computes the intersection
price_list <- do.call(rbind, price_list)

# Optional:
data.table::setDT(price_list)

R data.table：使用外部分配的列名引用数据表列

2 个答案: