导入.xlsx文件后,从矩阵列表构建正确的数据帧

时间:2014-10-30 20:15:28

标签: r excel dataframe xlsx

实施

我正在将.xlsx文件导入R. 该文件由三张纸组成。 我将所有表格绑定到列表中。

需要实施

现在我想将这个矩阵列表合并为一个data.frame。标题为 - >名称(数据集)。

我尝试在帮助中使用带有read.xlsx的as.data.frame,但它不起作用。 我明确尝试使用as.data.frame(as.table(dataset)),但它仍会生成一长列data.frame但我想要的却没有。

我希望有一个类似的结构 header = names和下面的值,就像read.table如何导入数据一样。

这是我正在使用的代码:

    xlfile <- list.files(pattern = "*.xlsx")
    wb <- loadWorkbook(xlfile)
    sheet_ct <- wb$getNumberOfSheets()
    b <- rbind(list(lapply(1:sheet_ct, function(x) {
             res <- read.xlsx(xlfile, x, as.data.frame = TRUE, header = TRUE)
})))
    b <- b [-c(1),] # Just want to remove the second header

我希望数据安排如下所示。

Ei  Mi  hours   Nphy    Cphy    CHLphy  Nhet    Chet    Ndet    Cdet    DON DOC DIN DIC AT  dCCHO   TEPC    Ncocco  Ccocco  CHLcocco    PICcocco    par Temp    Sal co2atm  u10 dicfl   co2ppm  co2mol  pH
1   1   1   1   0.1023488   0.6534707   0.1053458   0.04994161  0.3308593   0.04991916  0.3307085   0.05042275  49.76304    14.99330000 2050.132    2150.007    0.9642220   0.1339044   0.1040715   0.6500288   0.1087667   0.1000664   0.0000000   9.900000    31.31000    370 0.01    -2.963256000    565.1855    0.02562326  7.879427
2   1   1   2   0.1045240   0.6448216   0.1103250   0.04988347  0.3304699   0.04984045  0.3301691   0.05085697  49.52745    14.98729000 2050.264    2150.007    0.9308690   0.1652179   0.1076058   0.6386706   0.1164099   0.1001396   0.0000000   9.900000    31.31000    370 0.01    -2.971632000    565.7373    0.02564828  7.879042
3   1   1   3   0.1064772   0.6369597   0.1148174   0.04982555  0.3300819   0.04976363  0.3296314   0.05130091  49.29323    14.98221000 2050.396    2150.007    0.8997098   0.1941872   0.1104229   0.6291149   0.1225822   0.1007908   0.8695131   9.900000    31.31000    370 0.01    -2.980446000    566.3179    0.02567460  7.878636
4   1   1   4   0.1081702   0.6299084   0.1187672   0.04976784  0.3296952   0.04968840  0.3290949   0.05175249  49.06034    14.97810000 2050.524    2150.007    0.8705440   0.2210289   0.1125141   0.6213265   0.1273103   0.1018360   1.5513170   9.900000    31.31000    370 0.01    -2.989259000    566.8983    0.02570091  7.878231
5   1   1   5   0.1095905   0.6239005   0.1221460   0.04971029  0.3293089   0.04961446  0.3285598   0.05220978  48.82878    14.97485000 2050.641    2150.007    0.8431960   0.2459341   0.1140222   0.6152447   0.1308843   0.1034179   2.7777070   9.900000

请不要建议我将所有数据放在一张纸上,并将.xlsx转换为.csv或简单文本格式。我正在努力从.xlsx文件中获得正确的数据框。

以下是file

这是以下帖子:Followup

这就是结果:

str(full_data)
'data.frame':   0 obs. of  19 variables:
 $ Experiment : Factor w/ 2 levels "#","1": 
 $ Mesocosm   : Factor w/ 10 levels "#","1","2","3",..: 
 $ Exp.day    : Factor w/ 24 levels "1","10","11",..: 
 $ Hour       : Factor w/ 24 levels "108","12","132",..: 
 $ Temperature: Factor w/ 125 levels "10","10.01","10.02",..: 
 $ Salinity   : num 
 $ pH         : num 
 $ DIC        : Factor w/ 205 levels "1582.2925","1588.6475",..: 
 $ TA         : Factor w/ 117 levels "1813","1826",..: 
 $ DIN        : Factor w/ 66 levels "0.2","0.3","0.4",..: 
 $ Chl.a      : Factor w/ 156 levels "0.171","0.22",..: 
 $ PIC        : Factor w/ 194 levels "-0.47","-0.96",..: 
 $ POC        : Factor w/ 199 levels "-0.046","1.733",..: 
 $ PON        : Factor w/ 151 levels "1.675","1.723",..: 
 $ POP        : Factor w/ 110 levels "0.032","0.034",..: 
 $ DOC        : Factor w/ 93 levels "100.1","100.4",..: 
 $ DON        : Factor w/ 1 level "µmol/L": 
 $ DOP        : Factor w/ 1 level "µmol/L": 
 $ TEP        : Factor w/ 100 levels "10.4934","11.0053",..: 

  [Note: Above is the structure after reading from .xlsx file......the levels makes the calculation and manipulation part tedious and messy.]

这就是我想要实现的目标:

  

STR(A)

'data.frame':   9936 obs. of  29 variables:
 $ Ei      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Mi      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ hours   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Cphy    : num  0.653 0.645 0.637 0.63 0.624 ...
 $ CHLphy  : num  0.105 0.11 0.115 0.119 0.122 ...
 $ Nhet    : num  0.0499 0.0499 0.0498 0.0498 0.0497 ...
 $ Chet    : num  0.331 0.33 0.33 0.33 0.329 ...
 $ Ndet    : num  0.0499 0.0498 0.0498 0.0497 0.0496 ...
 $ Cdet    : num  0.331 0.33 0.33 0.329 0.329 ...
 $ DON     : num  0.0504 0.0509 0.0513 0.0518 0.0522 ...
 $ DOC     : num  49.8 49.5 49.3 49.1 48.8 ...
 $ DIN     : num  15 15 15 15 15 ...
 $ DIC     : num  2050 2050 2050 2051 2051 ...
 $ AT      : num  2150 2150 2150 2150 2150 ...
 $ dCCHO   : num  0.964 0.931 0.9 0.871 0.843 ...
 $ TEPC    : num  0.134 0.165 0.194 0.221 0.246 ...
 $ Ncocco  : num  0.104 0.108 0.11 0.113 0.114 ...
 $ Ccocco  : num  0.65 0.639 0.629 0.621 0.615 ...
 $ CHLcocco: num  0.109 0.116 0.123 0.127 0.131 ...
 $ PICcocco: num  0.1 0.1 0.101 0.102 0.103 ...
 $ par     : num  0 0 0.87 1.55 2.78 ...
 $ Temp    : num  9.9 9.9 9.9 9.9 9.9 9.9 9.9 9.9 9.9 9.9 ...
 $ Sal     : num  31.3 31.3 31.3 31.3 31.3 ...
 $ co2atm  : num  370 370 370 370 370 370 370 370 370 370 ...
 $ u10     : num  0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ...
 $ dicfl   : num  -2.96 -2.97 -2.98 -2.99 -3 ...
 $ co2ppm  : num  565 566 566 567 567 ...
 $ co2mol  : num  0.0256 0.0256 0.0257 0.0257 0.0257 ...
 $ pH      : num  7.88 7.88 7.88 7.88 7.88 ...

[注意:对于额外的列感到抱歉,这是另一个数据集(简单文本),我正在阅读read.table]

随着NA的处理:

> unique(mydf_1$Exp.num)
[1] # 1
Levels: # 1
> unique(mydf_2$Exp.num)
[1] # 2
Levels: # 2
> unique(mydf_3$Exp.num)
[1] # 3
Levels: # 3
> unique(full_data$Exp.num)
[1] 2 3 4

不处理NA&#39;

> unique(full_data$Exp.num)
[1]  1 NA  2  3
> unique(full_data$Mesocosm)
 [1]  1  2  3  4  5  6  7  8  9 NA

1 个答案:

答案 0 :(得分:1)

我认为这就是你所需要的。我对我正在做的事情添加一些评论:

xlfile <- list.files(pattern = "*.xlsx")
wb <- loadWorkbook(xlfile)
sheet_ct <- wb$getNumberOfSheets()
for( i in 1:sheet_ct) {    #read the sheets into 3 separate dataframes (mydf_1, mydf_2, mydf3)
  print(i)
  variable_name <- sprintf('mydf_%s',i)
  assign(variable_name, read.xlsx(xlfile, sheetIndex=i,startRow=1, endRow=209)) #using this you don't need to use my formula to eliminate NAs. but you need to specify the first and last rows.
}

colnames(mydf_1) <- names(mydf_2) #this here was unclear. I chose the second sheet's
# names as column names but you can chose whichever you want using the same (second and third column had the same names).

#some of the sheets were loaded with a few blank rows (full of NAs) which I remove 
#with the following function according to the first column which is always populated
#according to what I see
remove_na_rows <- function(x) {
  x <- x[!is.na(x)]
  a <- length(x==TRUE)
}

mydf_1 <- mydf_1[1:remove_na_rows(mydf_1$Exp.num),]
mydf_2 <- mydf_2[1:remove_na_rows(mydf_2$Exp.num),]
mydf_3 <- mydf_3[1:remove_na_rows(mydf_3$Exp.num),]

full_data <- rbind(mydf_1[-1,],mydf_2[-1,],mydf_3[-1,]) #making one dataframe here
full_data <- lapply(full_data,function(x) as.numeric(x)) #convert fields to numeric
full_data2$Ei <- as.integer(full_data[['Ei']]) #use this to convert any column to integer
full_data2$Mi <- as.integer(full_data[['Mi']])
full_data2$hours <- as.integer(full_data[['hours']])

#*********code to use for removing NA rows *****************
#so if you rbind not caring about the NA rows you can use the below to get rid of them
#I just tested it and it seems to be working

n_row <- NULL
for ( i in 1:nrow(full_data)) {
  x <- full_data[i,]
  if ( all(is.na(x)) ) { 
    n_row <- append(n_row,i)
  }
}

full_data <- full_data[-n_row,]

我认为现在这就是你所需要的