我有一个包含多个工作表的excel文件,每个工作表有几列,所以我不想单独指定列的类型,而是自动指定。我想以stringsAsFactors= FALSE
的方式阅读它们,因为它正确地解释了列的类型。在我当前的方法中,列宽“0.492±0.6”被解释为数字,返回NA,“因为”stringsAsFactors
选项在read_excel
中不可用。所以在这里,我写了一个解决方法,或多或少有效,但我不能在现实生活中使用,因为我不允许创建一个新文件。注意:我需要其他列作为数字或整数,还有其他只有文字作为字符的列,stringsAsFactors
在我的read.csv
示例中。
library(readxl)
file= "myfile.xlsx"
firstread<-read_excel(file, sheet = "mysheet", col_names = TRUE, na = "", skip = 0)
#firstread has the problem of the a column with "0.492 ± 0.6",
#being interpreted as number (returns NA)
colna<-colnames(firstread)
# read every column as character
colnumt<-ncol(firstread)
textcol<-rep("text", colnumt)
secondreadchar<-read_excel(file, sheet = "mysheet", col_names = TRUE,
col_types = textcol, na = "", skip = 0)
# another column, with the number 0.532, is now 0.5319999999999999
# and several other similar cases.
# read again with stringsAsFactors
# critical step, in real life, I "cannot" write a csv file.
write.csv(secondreadchar, "allcharac.txt", row.names = FALSE)
stringsasfactor<-read.csv("allcharac.txt", stringsAsFactors = FALSE)
colnames(stringsasfactor)<-colna
# column with "0.492 ± 0.6" now is character, as desired, others numeric as desired as well
答案 0 :(得分:1)
这是一个导入excel文件中所有数据的脚本。它将每个工作表的数据放在list
dfs
:
library(readxl)
# Get all the sheets
all_sheets <- excel_sheets("myfile.xlsx")
# Loop through the sheet names and get the data in each sheet
dfs <- lapply(all_sheets, function(x) {
#Get the number of column in current sheet
col_num <- NCOL(read_excel(path = "myfile.xlsx", sheet = x))
# Get the dataframe with columns as text
df <- read_excel(path = "myfile.xlsx", sheet = x, col_types = rep('text',col_num))
# Convert to data.frame
df <- as.data.frame(df, stringsAsFactors = FALSE)
# Get numeric fields by trying to convert them into
# numeric values. If it returns NA then not a numeric field.
# Otherwise numeric.
cond <- apply(df, 2, function(x) {
x <- x[!is.na(x)]
all(suppressWarnings(!is.na(as.numeric(x))))
})
numeric_cols <- names(df)[cond]
df[,numeric_cols] <- sapply(df[,numeric_cols], as.numeric)
# Return df in desired format
df
})
# Just for convenience in order to remember
# which sheet is associated with which dataframe
names(dfs) <- all_sheets
该过程如下:
首先,使用excel_sheets
获取文件中的所有工作表,然后循环工作表名称以创建数据框。对于每个数据框,您最初通过将text
参数设置为col_types
将数据导入为text
。将数据框的列作为文本后,您可以将结构从tibble
转换为data.frame
。之后,您会找到实际为数字列的列,并将它们转换为数值。
截至4月底,readxl
的新版本已发布,read_excel
功能获得了与此问题相关的两项增强功能。第一个是你可以让函数用你的参数猜测列类型&#34; guess&#34;提供给col_types
参数。第二个增强(第一个的推论)是guess_max
参数被添加到read_excel
函数。此新参数允许您设置用于猜测列类型的行数。基本上,我上面写的内容可以用以下内容缩短:
library(readxl)
# Get all the sheets
all_sheets <- excel_sheets("myfile.xlsx")
dfs <- lapply(all_sheets, function(sheetname) {
suppressWarnings(read_excel(path = "myfile.xlsx",
sheet = sheetname,
col_types = 'guess',
guess_max = Inf))
})
# Just for convenience in order to remember
# which sheet is associated with which dataframe
names(dfs) <- all_sheets
我建议您将readxl
更新为最新版本,以缩短您的脚本,从而避免可能的烦恼。
我希望这会有所帮助。