用R操纵数据

时间:2014-03-16 02:31:24

标签: r split duplicates

我正在尝试将包含许多行和列的数百个(excel)文件整理成一个大型数据集。我能够编写一个代码来将文件整理成一个。但是,我很难清理数据。特别是,我有以下障碍

  1. 许多(并非所有)文件都有重复的列 - 目前R保留所有列,并在重复列之后附加X.1X.2等。如何阻止R读取重复列并移至下一列?

  2. 我通过设置check.names = F阻止R在列名中输入一个点。然后我用_替换了空格。现在棘手的部分是我想将每列分成4列 - 例如。 X_1%_2/2/2012:P是我的列名,其中包含数据(NA,空白和其他数据)。如何创建4列,以便X 1% 2/2/2012 P列为4个不同的列,其中P列下的原始数据为空白列,其他3列为空白列?

  3. 下面是我的参考代码(它不是最有效的代码,但它是我的第一个R代码 - 我还有很长的路要走......)

    # I am using XLConnect package so far
    
    
    data.files = list.files(pattern = "*.xls")
    df = readWorksheetFromFile(file=data.files[1], sheet=1, check.names=F) # Read the first     file
    
    # Loop through the remaining files and merge them to the existing data frame
    for (file in data.files[-1]) {
    newFile = readWorksheetFromFile(file=file, sheet=1, check.names=F)
    df = merge(df, newFile, all = TRUE, check.names=F)
    }
    
    write.csv(df, file="read_1.csv")
    
    #Deleting some unwanted columns
    temp = df
    temp = df[,-agrep("Mty", colnames(df))]
    df = temp
    temp = df[,-agrep("Dur", colnames(df))]
    df = temp
    
    #Replacing spaces in the column names with underscores
    names(df) <- sub(" ", "_", names(df))
    names(df) <- sub(" ", "_", names(df))
    
    write.csv(df, file="read_1.csv") #writing output in CSV
    

    下面是dput(head(df))输出..(我删除了大量数据以获得简洁的输出)...

    structure(list(Date = structure(c(1373259600, 1373346000, 1373432400, 
    1373518800, 1373605200, 1373864400), class = c("POSIXct", "POSIXt"
    ), tzone = ""), `AA_5.55_02/01/2017:Price` = c(105.57574, 105.63598, 
    105.70395, 106.62471, 106.49467, 106.62642), `AA_6.5_06/15/2018:Price` = c(106.75947, 
    106.84083, 107.248726, 108.383835, 108.39564, 108.3026), `AA_5.72_02/23/2019:Price` = c(101.00432, 
    101.09463, 101.67893, 102.75101, 103.0618, 103.267204), `AAL_6.125_06/01/2018:Price` = c(NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), `AAL_5.25_01/31/2021:Price` = c(NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), `AAL_4.95_01/15/2023:Price` = c(NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), `AALLN_2.625_04/03/2017:Price` = c(NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), `AALLN_2.625_09/27/2017:Price` = c(NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), `AALLN_9.375_04/08/2019:Price` = c(NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_)), .Names = c("Date", 
    "AA_5.55_02/01/2017:Price", "AA_6.5_06/15/2018:Price", "AA_5.72_02/23/2019:Price", 
    "AAL_6.125_06/01/2018:Price", "AAL_5.25_01/31/2021:Price", "AAL_4.95_01/15/2023:Price", 
    "AALLN_2.625_04/03/2017:Price", "AALLN_2.625_09/27/2017:Price", 
    "AALLN_9.375_04/08/2019:Price"), row.names = c(NA, 6L), class = "data.frame")
    

0 个答案:

没有答案