Question

我正在尝试将包含许多行和列的数百个（excel）文件整理成一个大型数据集。我能够编写一个代码来将文件整理成一个。但是，我很难清理数据。特别是，我有以下障碍

许多（并非所有）文件都有重复的列 - 目前R保留所有列，并在重复列之后附加X.1，X.2等。如何阻止R读取重复列并移至下一列？
我通过设置check.names = F阻止R在列名中输入一个点。然后我用_替换了空格。现在棘手的部分是我想将每列分成4列 - 例如。 X_1%_2/2/2012:P是我的列名，其中包含数据（NA，空白和其他数据）。如何创建4列，以便X 1% 2/2/2012 P列为4个不同的列，其中P列下的原始数据为空白列，其他3列为空白列？

下面是我的参考代码（它不是最有效的代码，但它是我的第一个R代码 - 我还有很长的路要走......）

# I am using XLConnect package so far


data.files = list.files(pattern = "*.xls")
df = readWorksheetFromFile(file=data.files[1], sheet=1, check.names=F) # Read the first     file

# Loop through the remaining files and merge them to the existing data frame
for (file in data.files[-1]) {
newFile = readWorksheetFromFile(file=file, sheet=1, check.names=F)
df = merge(df, newFile, all = TRUE, check.names=F)
}

write.csv(df, file="read_1.csv")

#Deleting some unwanted columns
temp = df
temp = df[,-agrep("Mty", colnames(df))]
df = temp
temp = df[,-agrep("Dur", colnames(df))]
df = temp

#Replacing spaces in the column names with underscores
names(df) <- sub(" ", "_", names(df))
names(df) <- sub(" ", "_", names(df))

write.csv(df, file="read_1.csv") #writing output in CSV

下面是dput(head(df))输出..（我删除了大量数据以获得简洁的输出）...

structure(list(Date = structure(c(1373259600, 1373346000, 1373432400, 
1373518800, 1373605200, 1373864400), class = c("POSIXct", "POSIXt"
), tzone = ""), `AA_5.55_02/01/2017:Price` = c(105.57574, 105.63598, 
105.70395, 106.62471, 106.49467, 106.62642), `AA_6.5_06/15/2018:Price` = c(106.75947, 
106.84083, 107.248726, 108.383835, 108.39564, 108.3026), `AA_5.72_02/23/2019:Price` = c(101.00432, 
101.09463, 101.67893, 102.75101, 103.0618, 103.267204), `AAL_6.125_06/01/2018:Price` = c(NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), `AAL_5.25_01/31/2021:Price` = c(NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), `AAL_4.95_01/15/2023:Price` = c(NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), `AALLN_2.625_04/03/2017:Price` = c(NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), `AALLN_2.625_09/27/2017:Price` = c(NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), `AALLN_9.375_04/08/2019:Price` = c(NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_)), .Names = c("Date", 
"AA_5.55_02/01/2017:Price", "AA_6.5_06/15/2018:Price", "AA_5.72_02/23/2019:Price", 
"AAL_6.125_06/01/2018:Price", "AAL_5.25_01/31/2021:Price", "AAL_4.95_01/15/2023:Price", 
"AALLN_2.625_04/03/2017:Price", "AALLN_2.625_09/27/2017:Price", 
"AALLN_9.375_04/08/2019:Price"), row.names = c(NA, 6L), class = "data.frame")

用R操纵数据

0 个答案: