R:读取空白和列数不等的文本文件

时间:2017-11-08 16:01:00

标签: r text multiple-columns missing-data read.table

我正在尝试使用read.table将许多文本文件读入R中。大多数情况下,我们都有包含已定义列的干净文本文件。

我尝试阅读的数据来自ftp://ftp.cmegroup.com/delivery_reports/live_cattle_delivery/102317_livecattle.txt

您可以看到文本文件的空白和长度因报告而异。 ftp://ftp.cmegroup.com/delivery_reports/live_cattle_delivery/102317_livecattle.txt ftp://ftp.cmegroup.com/delivery_reports/live_cattle_delivery/100917_livecattle.txt

我的目标是阅读许多这些文本文件并将它们组合成一个数据集。

如果我可以阅读其中一个,那么编译应该不是问题。但是,由于文本文件的格式,我遇到了几个问题:

1)FIRMS的数量因报告而异。例如,有时会有3行(即在该数据上开展业务的3家公司)要导入的数据,有时可能会有10行。

2)空白被认可。例如,在FIRM部分下应该有一个交货(DEL)和收货(REC)列。在本部分中读取的数据应如下所示:

df <- data.frame("FIRM_#" = c(407, 685, 800, 905), 
  "FIRM_NAME" = c("STRAITS FIN LLC", "R.J.O'BRIEN ASSOC", "ROSENTHAL COLLINS LL", "ADM INVESTOR SERVICE"),
  "DEL" = c(1,1,15,1), "REC"= c(NA,18,NA,NA))

然而,当我在fomatting中读到这一切时,所有这些都搞砸了并且没有为空白值设置NA

3)以上问题适用于&#34; YARDS&#34;和&#34;未来交付时间表&#34;文本文件的一部分。

我试图阅读文本文件的各个部分,然后相应地对其进行格式化,但由于公司的数量每天都在变化,因此代码不会一概而论。

非常感谢任何帮助。

1 个答案:

答案 0 :(得分:1)

Here an answer which starts from the scratch via rvest for downloading data and includes lots of formatting. The general idea is to identify fixed widths that may be used to separate columns - I used a little help from SO for this purpose link.

You could then use read.fwf() in combination with cat()and tempfile(). In my first attempt this did not work, due to some formatting issues, so I added some additional lines to get the final table format.

Maybe there are some more elegant options and shortcuts I have overseen, but at least, my answer should get you started. Of course, you will have to adapt the selection of lines, identification of widths for spliting tables depending on what parts of the data you need. Once this is settled, you may loop through all the websites to gather data. I hope this helps...

library(rvest)
library(dplyr)

page <- read_html("ftp://ftp.cmegroup.com/delivery_reports/live_cattle_delivery/102317_livecattle.txt")

table <- page %>%
  html_text("pre") %>%
  #reformat by splitting on line breakes
  { unlist(strsplit(., "\n")) } %>%
  #select range based on strings in specific lines
  "["(.,(grep("FIRM #", .):(grep("        DELIVERIES SCHEDULED", .)-1))) %>%
  #exclude empty rows
  "["(., !grepl("^\\s+$", .)) %>%
  #fix width of table to the right
  { substring(., 1, nchar(gsub("\\s+$", "" , .[1]))) } %>%
  #strip white space on the left
  { gsub("^\\s+", "", .) }


headline <- unlist(strsplit(table[1], "\\s{2,}"))

get_split_position <- function(substring, string) {

   nchar(string)-nchar(gsub(paste0("(^.*)(?=", substring, ")"), "", string , perl=T))

}

#exclude first element, no split before this element
split_positions <- sapply(headline[-1], function(x) {

   get_split_position(x, table[1])

})


#exclude headline from split
table <- lapply(table[-1], function(x) {

  substring(x,  c(1, split_positions + 1),  c(split_positions, nchar(x)))

})

table <- do.call(rbind, table)
colnames(table) <- headline

#strip whitespace
table <- gsub("\\s+", "", table)

table <- as.data.frame(table, stringsAsFactors = FALSE)
#assign NA values
table[ table == "" ] <- NA
#change column type
table[ , c("FIRM #", "DEL", "REC")] <- apply(table[ , c("FIRM #", "DEL", "REC")], 2,  as.numeric)

table
# FIRM #          FIRM NAME DEL REC
# 1    407      STRAITSFINLLC   1  NA
# 2    685   R.J.O'BRIENASSOC   1  18
# 3    800 ROSENTHALCOLLINSLL  15  NA
# 4    905 ADMINVESTORSERVICE   1  NA