我正在使用 readxl 库来读取同一个excel工作簿(称为 data.xlsx )中的许多excel工作表,格式如下:
数据从第3行开始。
row1
row2
companyName 1980 1981 1982 ... 2016
company1 5 6 7 8
company2 10 20 30 40
company3 20 40 60 80
....
每行和每列的数据范围长度不同。但是,他们将 companyName 作为常用密钥。 年范围从1980年或1990年到2016年不等。工作表名称是数据名称。
我想创建一个excel,其中所有数据都从所有工作表中提取。
companyName Year dataname values
company1 1980 sheetname1 5
company1 1981 sheetname1 6
company1 1982 sheetname1 7
company1 ... sheetname1 ...
company1 2016 sheetname1 8
company2 1980 sheetname1 10
company2 1981 sheetname1 20
company2 1982 sheetname1 30
company2 ... sheetname1 ...
company2 2016 sheetname1 40
.... .... ... ...
company1 2000 sheetname2 xxx
company1 2001 sheetname2 yyy
etc
etc
etc
这也是我设法得到的:
library(tidyverse)
library(readxl)
library(data.table)
#read excel file (from [here][1])
file.list<-"data.xlsx"
**#read all sheets (and **skip** first two rows)**
df.list <- lapply(file.list,function(x) {
sheets <- excel_sheets(x)
dfs <- lapply(sheets, function(y) {
read_excel(x, sheet = y,skip=2)
})
names(dfs) <- sheets
dfs
})
我有以下问题:
感谢您的帮助。
来源: R: reading multiple excel files, extract first sheet names, and create new column
答案 0 :(得分:3)
我刚从df.list
删除了一级嵌套。
df.list <- lapply(file.list,function(x) {
sheets <- excel_sheets(x)
dfs <- lapply(sheets, function(y) {
read_excel(x, sheet = y,skip=2)
})
names(dfs) <- sheets
dfs
})[[1]]
这对我有用。我无法通过跳过来复制您的问题。此外,如果行只是空行,read_excel()
默认情况下应使用trim_ws = TRUE
跳过它们。
我使用以下列表来演示导入后要执行的操作。
df.list <- structure(list(sheetname1 = structure(list(companyName = c("company1",
"company2", "company3"), `1980` = c(5, 10, 40), `1981` = c(6,
20, 50), `1982` = c(7, 30, 60)), .Names = c("companyName", "1980",
"1981", "1982"), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame")), sheetname2 = structure(list(companyName = c("company1",
"company2", "company3"), `1980` = c(6, 11, 42), `1981` = c(7,
21, 52), `1982` = c(8, 31, 62)), .Names = c("companyName", "1980",
"1981", "1982"), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame")), sheetname3 = structure(list(companyName = c("company1",
"company2", "company3"), `1990` = c(8, 12, 43), `1991` = c(9,
22, 53), `1992` = c(10, 32, 63)), .Names = c("companyName", "1990",
"1991", "1992"), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"))), .Names = c("sheetname1", "sheetname2",
"sheetname3"))
即使这些年份始于1980年或1990年,以下情况仍有效。
dat <- lapply(df.list, function(x){
nrows = nrow(x)
years = names(x[,2:nrows])
x %>% gather(year, values, -companyName)
}) %>% enframe() %>% unnest()
dat
# # A tibble: 27 x 4
# name companyName year values
# <chr> <chr> <chr> <dbl>
# 1 sheetname1 company1 1980 5.
# 2 sheetname1 company2 1980 10.
# 3 sheetname1 company3 1980 40.
# 4 sheetname1 company1 1981 6.
# 5 sheetname1 company2 1981 20.
# 6 sheetname1 company3 1981 50.
# 7 sheetname1 company1 1982 7.
# 8 sheetname1 company2 1982 30.
# 9 sheetname1 company3 1982 60.
# 10 sheetname2 company1 1980 6.
# # ... with 17 more rows
现在,您可以使用sheetname
。
dplyr::filter()
例如:
dat %>% filter(name == "sheetname1")
# name companyName year values
# <chr> <chr> <chr> <dbl>
# 1 sheetname1 company1 1980 5.
# 2 sheetname1 company2 1980 10.
# 3 sheetname1 company3 1980 40.
# 4 sheetname1 company1 1981 6.
# 5 sheetname1 company2 1981 20.
# 6 sheetname1 company3 1981 50.
# 7 sheetname1 company1 1982 7.
# 8 sheetname1 company2 1982 30.
# 9 sheetname1 company3 1982 60.
答案 1 :(得分:2)
我建议使用openxlsx
一个包,它允许您从包startRow
指定melt
和reshape2
,它可以将数据框更改为长格式一种简单的方式。
library(openxlsx)
library(reshape2)
first.Row <- 6 # supposing the data starts at row 6
sheets.2.read <- loadWorkbook(file.list)$sheet_names # retrieving the sheet names
df <- data.frame()
for(tmp.sheet in sheets.2.read){
tmp.dat <- read.xlsx(file.list, sheet = tmp.sheet, startRow = first.Row, colNames = TRUE)
tmp.dat <- cbind(melt(tmp.dat, id.vars = "companyName"), tmp.sheet)
df <- rbind(df, tmp.dat)
}
这是我输出的一些虚拟数据(只打印10行):
> df[c(1:3, 50:53, 300:302),]
company.name variable value tmp.sheet
1 comp7 1968 0.3359298 Sheet1
2 comp8 1968 0.3359298 Sheet1
3 comp9 1968 0.3359298 Sheet1
50 comp16 1966 0.3359298 Sheet2
51 comp17 1966 0.3359298 Sheet2
52 comp18 1966 0.3359298 Sheet2
53 comp19 1966 0.3359298 Sheet2
300 comp16 2000 0.3359298 Sheet3
301 comp17 2000 0.3359298 Sheet3
302 comp18 2000 0.3359298 Sheet3