在R中读取多个excel工作表时跳过行

时间:2018-03-06 09:20:31

标签: r

我正在使用 readxl 库来读取同一个excel工作簿(称为 data.xlsx )中的许多excel工作表,格式如下:

数据从第3行开始。

  row1
  row2
 companyName   1980    1981    1982 ... 2016
 company1       5       6       7        8
 company2       10      20      30       40
 company3       20      40      60       80
 ....

每行和每列的数据范围长度不同。但是,他们将 companyName 作为常用密钥。 范围从1980年或1990年到2016年不等。工作表名称数据名称

我想创建一个excel,其中所有数据都从所有工作表中提取。

 companyName   Year   dataname     values
 company1      1980   sheetname1     5
 company1      1981   sheetname1     6
 company1      1982   sheetname1     7
 company1      ...    sheetname1     ...
 company1      2016   sheetname1     8
 company2      1980   sheetname1     10
 company2      1981   sheetname1     20
 company2      1982   sheetname1     30
 company2      ...    sheetname1     ...
 company2      2016   sheetname1     40
 ....          ....     ...           ...
 company1      2000    sheetname2     xxx
 company1      2001    sheetname2     yyy
  etc
  etc
  etc

这也是我设法得到的:

  library(tidyverse)
  library(readxl)
  library(data.table)

   #read excel file (from [here][1])
   file.list<-"data.xlsx"

     **#read all sheets (and **skip** first two rows)**

   df.list <- lapply(file.list,function(x) {
     sheets <- excel_sheets(x)
     dfs <- lapply(sheets, function(y) {
       read_excel(x, sheet = y,skip=2)
       })
     names(dfs) <- sheets
     dfs
   })

我有以下问题:

  • 未跳过前两行
  • 我如何创建一个仅包含选择工作表的数据框(即工作表5,工作表10和工作表15)。

感谢您的帮助。

来源: R: reading multiple excel files, extract first sheet names, and create new column

2 个答案:

答案 0 :(得分:3)

我刚从df.list删除了一级嵌套。

df.list <- lapply(file.list,function(x) {
    sheets <- excel_sheets(x)
    dfs <- lapply(sheets, function(y) {
    read_excel(x, sheet = y,skip=2)
  })
  names(dfs) <- sheets
  dfs 
})[[1]]

这对我有用。我无法通过跳过来复制您的问题。此外,如果行只是空行,read_excel()默认情况下应使用trim_ws = TRUE跳过它们。

我使用以下列表来演示导入后要执行的操作。

df.list <- structure(list(sheetname1 = structure(list(companyName = c("company1", 
"company2", "company3"), `1980` = c(5, 10, 40), `1981` = c(6, 
20, 50), `1982` = c(7, 30, 60)), .Names = c("companyName", "1980", 
"1981", "1982"), row.names = c(NA, -3L), class = c("tbl_df", 
"tbl", "data.frame")), sheetname2 = structure(list(companyName = c("company1", 
"company2", "company3"), `1980` = c(6, 11, 42), `1981` = c(7, 
21, 52), `1982` = c(8, 31, 62)), .Names = c("companyName", "1980", 
"1981", "1982"), row.names = c(NA, -3L), class = c("tbl_df", 
"tbl", "data.frame")), sheetname3 = structure(list(companyName = c("company1", 
"company2", "company3"), `1990` = c(8, 12, 43), `1991` = c(9, 
22, 53), `1992` = c(10, 32, 63)), .Names = c("companyName", "1990", 
"1991", "1992"), row.names = c(NA, -3L), class = c("tbl_df", 
"tbl", "data.frame"))), .Names = c("sheetname1", "sheetname2", 
"sheetname3"))

即使这些年份始于1980年或1990年,以下情况仍有效。

dat <- lapply(df.list, function(x){
  nrows = nrow(x)
  years = names(x[,2:nrows])
  x %>% gather(year, values, -companyName)
}) %>% enframe() %>% unnest()

dat

# # A tibble: 27 x 4
#    name       companyName year  values
#    <chr>      <chr>       <chr>  <dbl>
#  1 sheetname1 company1    1980      5.
#  2 sheetname1 company2    1980     10.
#  3 sheetname1 company3    1980     40.
#  4 sheetname1 company1    1981      6.
#  5 sheetname1 company2    1981     20.
#  6 sheetname1 company3    1981     50.
#  7 sheetname1 company1    1982      7.
#  8 sheetname1 company2    1982     30.
#  9 sheetname1 company3    1982     60.
# 10 sheetname2 company1    1980      6.
# # ... with 17 more rows

现在,您可以使用sheetname

来使用特定的dplyr::filter()

例如:

dat %>% filter(name == "sheetname1")

#   name       companyName year  values
#   <chr>      <chr>       <chr>  <dbl>
# 1 sheetname1 company1    1980      5.
# 2 sheetname1 company2    1980     10.
# 3 sheetname1 company3    1980     40.
# 4 sheetname1 company1    1981      6.
# 5 sheetname1 company2    1981     20.
# 6 sheetname1 company3    1981     50.
# 7 sheetname1 company1    1982      7.
# 8 sheetname1 company2    1982     30.
# 9 sheetname1 company3    1982     60.

答案 1 :(得分:2)

我建议使用openxlsx一个包,它允许您从包startRow指定meltreshape2,它可以将数据框更改为长格式一种简单的方式。

library(openxlsx)
library(reshape2)

first.Row <- 6 # supposing the data starts at row 6
sheets.2.read <- loadWorkbook(file.list)$sheet_names # retrieving the sheet names
df <- data.frame()
for(tmp.sheet in sheets.2.read){
  tmp.dat <- read.xlsx(file.list, sheet = tmp.sheet, startRow = first.Row, colNames = TRUE)
  tmp.dat <- cbind(melt(tmp.dat, id.vars = "companyName"), tmp.sheet)
  df <- rbind(df, tmp.dat)
}

这是我输出的一些虚拟数据(只打印10行):

> df[c(1:3, 50:53, 300:302),]
    company.name variable     value tmp.sheet
1          comp7     1968 0.3359298    Sheet1
2          comp8     1968 0.3359298    Sheet1
3          comp9     1968 0.3359298    Sheet1
50        comp16     1966 0.3359298    Sheet2
51        comp17     1966 0.3359298    Sheet2
52        comp18     1966 0.3359298    Sheet2
53        comp19     1966 0.3359298    Sheet2
300       comp16     2000 0.3359298    Sheet3
301       comp17     2000 0.3359298    Sheet3
302       comp18     2000 0.3359298    Sheet3