R readxl从Excel文件

时间:2017-11-15 19:24:21

标签: r excel web-scraping readxl

我无法将*.xls个文件中的数据读入R.我尝试使用readxl::read_xls()从以下网址读取Microsoft Excel文件中的数据:https://www.misoenergy.org/Library/Repository/Market%20Reports/20171114_5min_exante_lmp.xls 。我在R版本3.4.1(单烛)上,sessionInfo()的输出粘贴在这篇文章的最底部。

该文件有6张包含数据的表格。作为一个最小的例子,考虑阅读名为RT Ex-Ante 5 Minute LMPs(1)的第二张表。下面的代码是我第一次尝试阅读这张表:

library(readxl)
fpath <- '/Users/bmosovsky/Downloads/20171114_5min_exante_lmp.xls'
data <- read_excel( path=fpath, sheet=2, col_names=FALSE )

这允许read_excel猜测要读取的数据范围和列类型。我收到了警告信息,

Warning message:
In read_fun(path = path, sheet = sheet, limits = limits, shim = shim,  :
  Expecting logical in B65535 / R65535C2: got 'IPL.CC.IPLEV01'

str(data)返回

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   65535 obs. of  6 variables:
 $ X__1: POSIXct, format: "2017-11-13 04:35:00" "2017-11-13 04:35:00" "2017-11-13 04:35:00" "2017-11-13 04:35:00" ...
 $ X__2: logi  NA NA NA NA NA NA ...
 $ X__3: logi  NA NA NA NA NA NA ...
 $ X__4: logi  NA NA NA NA NA NA ...
 $ X__5: logi  NA NA NA NA NA NA ...
 $ X__6: logi  NA NA NA NA NA NA ...

认为read_excel()可能只是错误地猜测了列类型,然后我尝试了:

data1 <- read_excel( path=fpath, sheet=2, col_names=FALSE, 
                    col_types=c('text', 'text', 'numeric', 'numeric', 'numeric', 'numeric') )

这消除了警告,因为列被正确输入,但我仍然获得除第一列之外的所有列的NA值。这次str(data1)返回了

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   65535 obs. of  6 variables:
 $ X__1: chr  "43052.2" "43052.2" "43052.2" "43052.2" ...
 $ X__2: chr  NA NA NA NA ...
 $ X__3: num  NA NA NA NA NA NA NA NA NA NA ...
 $ X__4: num  NA NA NA NA NA NA NA NA NA NA ...
 $ X__5: num  NA NA NA NA NA NA NA NA NA NA ...
 $ X__6: num  NA NA NA NA NA NA NA NA NA NA ...

最后,我尝试将Excel文件的第二页中的前10行数据(格式和全部)粘贴到新的Excel工作簿中,保存为test.xls,然后尝试以下操作:

fpath_test <- '/Users/bmosovsky/Downloads/test.xls'
data_test <- read_excel( path=fpath_test, sheet=1, col_names=FALSE,
                         col_types=c('text', 'text', 'numeric', 'numeric', 'numeric', 'numeric') )

现在str(data_test)会返回正确的结果:

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   10 obs. of  6 variables:
 $ X__1: chr  "43052.2" "43052.2" "43052.2" "43052.2" ...
 $ X__2: chr  "CIN.MARKLND.3" "CIN.MIAMWAB.1" "CIN.MIAMWAB.2" "CIN.MIAMWAB.3" ...
 $ X__3: num  22.4 22.6 22.6 22.6 22.5 ...
 $ X__4: num  21.6 21.6 21.6 21.6 21.6 21.6 21.6 21.6 21.6 21.6
 $ X__5: num  0.8 1.02 1.02 1.02 0.92 0.93 1.29 1.29 1.29 0.06
 $ X__6: num  0.04 0.01 0.01 0.01 0.01 0.01 0.05 0.05 0.05 0.06

所以,我的问题是,下载的Excel文件有什么独特之处,不允许将数据正确读入R?我试图将此数据作为自动数据收集过程的一部分进行读取,因此任何类型的Excel文件的手动操作都不可能作为解决方法。任何人都可以提供一些见解,了解如何将.xls文件的所有表格中的数据导入R进行处理?

以下是sessionInfo()的输出:

R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] tools     stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.2      rvest_0.3.2       xml2_1.1.1        RPostgreSQL_0.6-2 DBI_0.7-12        lubridate_1.6.0   dplyr_0.7.2       readxl_1.0.0     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.12.3   tidyr_0.6.3      assertthat_0.2.0 cellranger_1.1.0 R6_2.2.2         magrittr_1.5     httr_1.2.1       rlang_0.1.1      stringi_1.1.5   
[10] curl_2.8.1       stringr_1.2.0    glue_1.1.1       compiler_3.4.1   pkgconfig_2.0.1  bindr_0.1        tibble_1.3.3 

0 个答案:

没有答案