我正在尝试将包含许多行和列的数百个(excel)文件整理成一个大型数据集。我能够编写一个代码来将文件整理成一个。但是,我很难清理数据。特别是,我有以下障碍
许多(并非所有)文件都有重复的列 - 目前R保留所有列,并在重复列之后附加X.1
,X.2
等。如何阻止R读取重复列并移至下一列?
我通过设置check.names = F
阻止R在列名中输入一个点。然后我用_替换了空格。现在棘手的部分是我想将每列分成4列 - 例如。 X_1%_2/2/2012:P
是我的列名,其中包含数据(NA
,空白和其他数据)。如何创建4列,以便X 1% 2/2/2012 P
列为4个不同的列,其中P
列下的原始数据为空白列,其他3列为空白列?
下面是我的参考代码(它不是最有效的代码,但它是我的第一个R代码 - 我还有很长的路要走......)
# I am using XLConnect package so far
data.files = list.files(pattern = "*.xls")
df = readWorksheetFromFile(file=data.files[1], sheet=1, check.names=F) # Read the first file
# Loop through the remaining files and merge them to the existing data frame
for (file in data.files[-1]) {
newFile = readWorksheetFromFile(file=file, sheet=1, check.names=F)
df = merge(df, newFile, all = TRUE, check.names=F)
}
write.csv(df, file="read_1.csv")
#Deleting some unwanted columns
temp = df
temp = df[,-agrep("Mty", colnames(df))]
df = temp
temp = df[,-agrep("Dur", colnames(df))]
df = temp
#Replacing spaces in the column names with underscores
names(df) <- sub(" ", "_", names(df))
names(df) <- sub(" ", "_", names(df))
write.csv(df, file="read_1.csv") #writing output in CSV
下面是dput(head(df))
输出..(我删除了大量数据以获得简洁的输出)...
structure(list(Date = structure(c(1373259600, 1373346000, 1373432400,
1373518800, 1373605200, 1373864400), class = c("POSIXct", "POSIXt"
), tzone = ""), `AA_5.55_02/01/2017:Price` = c(105.57574, 105.63598,
105.70395, 106.62471, 106.49467, 106.62642), `AA_6.5_06/15/2018:Price` = c(106.75947,
106.84083, 107.248726, 108.383835, 108.39564, 108.3026), `AA_5.72_02/23/2019:Price` = c(101.00432,
101.09463, 101.67893, 102.75101, 103.0618, 103.267204), `AAL_6.125_06/01/2018:Price` = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), `AAL_5.25_01/31/2021:Price` = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), `AAL_4.95_01/15/2023:Price` = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), `AALLN_2.625_04/03/2017:Price` = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), `AALLN_2.625_09/27/2017:Price` = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), `AALLN_9.375_04/08/2019:Price` = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_)), .Names = c("Date",
"AA_5.55_02/01/2017:Price", "AA_6.5_06/15/2018:Price", "AA_5.72_02/23/2019:Price",
"AAL_6.125_06/01/2018:Price", "AAL_5.25_01/31/2021:Price", "AAL_4.95_01/15/2023:Price",
"AALLN_2.625_04/03/2017:Price", "AALLN_2.625_09/27/2017:Price",
"AALLN_9.375_04/08/2019:Price"), row.names = c(NA, 6L), class = "data.frame")