Question

我还是R的新手，如果我没有使用正确的术语，我会道歉。我有兴趣从财政部直接在线报告查询系统（http://www.treasurydirect.gov/govt/reports/tfmp/tfmp_utf.htm）中提取大量失业保险信托基金数据，并且我已成功使用readLines提取信息。

ESAA_OCT15 <- readLines('http://www.treasurydirect.gov/govt/reports/tfmp/utf/es/dfiw01015tses.txt')

它将图表作为字符串向量。

有没有办法解析这些行并将其转换为数据框，这样我至少可以把它变得更好并轻松地从中获取重要信息？我确定还有另一种方法可以做到这一点，但报告总是会有所不同，包括哪些会计代码部分以及包含多少个别交易，所以我甚至不知道从哪里开始。

我需要的项目是日期，股票/票面（美元交易金额），交易代码和交易描述。总计将是有用的，但绝不是必要的。

当你使用Excel查看它时，它看起来像

Answer 1

这将帮助您解析信息：

这样做首先选择包含d个字符的所有行。这甚至会返回列标题。这些内容存储在[1] "Effective Date Shares/Par Description Code Memo Number Code Account Number" [2] "10/01/2015 2,313,000.0000 12-10 FUTA RECEIPTS 3305617 ESAA" [3] "10/01/2015 3,663,000.0000 12-10 FUTA RECEIPTS 3305618 ESAA" [4] "10/02/2015 4,314,000.0000 12-10 FUTA RECEIPTS 3305640 ESAA" [5] "10/05/2015 3,512,000.0000 12-10 FUTA RECEIPTS 3305662 ESAA"。

中

如果你检查d：

substr

信息整齐排列。这意味着每列的数据在精确位置结束。要解析此问题，您可以使用data.frame(dates, sharesPar, ...)启动和停止，如我的脚本所示。

当然，我没有完成所有的解析，我会让你完成剩下的工作。解析每列后，创建const payload = {}; Object.keys(parentRecord).forEach((key) => payload[key] = parentRecord[key]); payload.childRecords = parentRecord.childRecords; DS.update('parent', parentRecord.id, payload)

Answer 2

它是固定宽度的格式，因此应该这样对待：

library(dplyr)
library(readr)

readLines("http://www.treasurydirect.gov/govt/reports/tfmp/utf/es/dfiw01015tses.txt") %>% 
  grep("^\ +[[:digit:]]+/[[:digit:]]+", ., value=TRUE) %>% # grab only the lines with data
  textConnection() %>% 
read.fwf(widths=c(19, 26, 27, 15, 10, 27), skip=7) %>%     # read the columns
  mutate_all(trimws) %>%                                   # clean them up
  type_convert() %>%                                       # you still need to convert the date even with this type conversion
  setNames(c("effective_date", "shares_per",               # add decent colnames
             "trans_descr_code", "memo_num", "location_code", "acct_no"))

读取并将不规则和混合的ASCII文件解析为R

2 个答案: