我正致力于创建一个自动流程来从年度PDF报告中提取表格。理想情况下,我能够获取每年的报告,从其中的表中提取数据,将所有年份组合成一个大型数据框,然后进行分析。以下是我到目前为止(仅关注报告的一年):
library(pdftools)
library(data.table)
library(dplyr)
download.file("https://higherlogicdownload.s3.amazonaws.com/NASBO/9d2d2db1-c943-4f1b-b750-0fca152d64c2/UploadedImages/SER%20Archive/State%20Expenditure%20Report%20(Fiscal%202014-2016)%20-%20S.pdf", "nasbo14_16.pdf", mode = "wb")
txt14_16 <- pdf_text("nasbo14_16.pdf")
## convert txt14_16 to data frame for analyzing
data <- toString(txt14_16[56])
data <- read.table(text = data, sep = "\n", as.is = TRUE)
data <- data[-c(1, 2, 3, 4, 5, 6, 7, 14, 20, 26, 34, 47, 52, 58, 65, 66, 67), ]
data <- gsub("[,]", "", data)
data <- gsub("[$]", "", data)
data <- gsub("\\s+", ",", gsub("^\\s+|\\s+$", "",data))
我的问题是将这些原始表数据转换为一个数据框,每个行包含每个状态,每个列都有各自的值。我确信解决方案很简单,但我对R来说有点新鲜!有什么帮助吗?
编辑:所有这些解决方案都非常棒,并且运行良好。但是,当我尝试另一年的报告时,我遇到了一些错误:: ' 0' does not exist in current working directory ('C:/Users/joshua_hanson/Documents').
为下一个报告尝试此代码后:
download.file("https://higherlogicdownload.s3.amazonaws.com/NASBO/9d2d2db1-c943-4f1b-b750-0fca152d64c2/UploadedImages/SER%20Archive/2010%20State%20Expenditure%20Report.pdf", "nasbo09_11.pdf", mode = "wb")
txt09_11 <- pdf_text("nasbo09_11.pdf")
df <- txt09_11[54] %>%
read_lines() %>% # separate lines
grep('^\\s{2}\\w', ., value = TRUE) %>% # select lines with states, which start with space, space, letter
paste(collapse = '\n') %>% # recombine
read_fwf(fwf_empty(.)) %>% # read as fixed-width file
mutate_at(-1, parse_number) %>% # make numbers numbers
mutate(X1 = sub('*', '', X1, fixed = TRUE)) # get rid of asterisks in state names
答案 0 :(得分:2)
readr::read_fwf
有一个fwf_empty
实用程序可以为您猜测列宽,这使得工作更加简单:
library(tidyverse)
df <- txt14_16[56] %>%
read_lines() %>% # separate lines
grep('^\\s{2}\\w', ., value = TRUE) %>% # select lines with states, which start with space, space, letter
paste(collapse = '\n') %>% # recombine
read_fwf(fwf_empty(.)) %>% # read as fixed-width file
mutate_at(-1, parse_number) %>% # make numbers numbers
mutate(X1 = sub('*', '', X1, fixed = TRUE)) # get rid of asterisks in state names
df
#> # A tibble: 50 × 13
#> X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Connecticut 3779 2992 0 6771 3496 3483 0 6979 3612
#> 2 Maine 746 1767 267 2780 753 1510 270 2533 776
#> 3 Massachusetts 6359 5542 143 12044 6953 6771 174 13898 7411
#> 4 New Hampshire 491 660 175 1326 515 936 166 1617 523
#> 5 Rhode Island 998 1190 31 2219 998 1435 24 2457 953
#> 6 Vermont 282 797 332 1411 302 923 326 1551 337
#> 7 Delaware 662 1001 0 1663 668 1193 14 1875 689
#> 8 Maryland 2893 4807 860 8560 2896 5686 1061 9643 2812
#> 9 New Jersey 3961 6920 1043 11924 3831 8899 1053 13783 3955
#> 10 New York 10981 24237 4754 39972 11161 29393 5114 45668 11552
#> # ... with 40 more rows, and 3 more variables: X11 <dbl>, X12 <dbl>,
#> # X13 <dbl>
显然,仍然需要添加列名,但此时数据相当可用。
答案 1 :(得分:0)
你的gsub
有点过于激进了。您通过data[-c(1,...)]
做得很好,所以我会从那里接听,将所有来电替换为gsub
:
# sloppy fixed-width parsing
dat2 <- read.fwf(textConnection(data), c(35,15,20,20,12,10,15,10,10,10,10,15,99))
# clean up extra whitespace
dat3 <- as.data.frame(lapply(dat2, trimws), stringsAsFactors = FALSE)
head(dat3)
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
# 1 Connecticut* $3,779 $2,992 $0 $6,771 $3,496 $3,483 $0 $6,979 $3,612 $3,604 $0 $7,216
# 2 Maine* 746 1,767 267 2,780 753 1,510 270 2,533 776 1,605 274 2,655
# 3 Massachusetts 6,359 5,542 143 12,044 6,953 6,771 174 13,898 7,411 7,463 292 15,166
# 4 New Hampshire 491 660 175 1,326 515 936 166 1,617 523 1,197 238 1,958
# 5 Rhode Island 998 1,190 31 2,219 998 1,435 24 2,457 953 1,527 22 2,502
# 6 Vermont* 282 797 332 1,411 302 923 326 1,551 337 948 338 1,623
警告:我使用的宽度(35,15,20,...)是匆匆派生的,尽管我认为它们有效,但我承认我并没有逐行检查我是不是砍了东西。 请验证!
最后,从这里你可能想要删除$
和,
并进行整合,这是相当直接的:
dat3[-1] <- lapply(dat3[-1], function(a) as.integer(gsub("[^0-9]", "", a)))
head(dat3)
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
# 1 Connecticut* 3779 2992 0 6771 3496 3483 0 6979 3612 3604 0 7216
# 2 Maine* 746 1767 267 2780 753 1510 270 2533 776 1605 274 2655
# 3 Massachusetts 6359 5542 143 12044 6953 6771 174 13898 7411 7463 292 15166
# 4 New Hampshire 491 660 175 1326 515 936 166 1617 523 1197 238 1958
# 5 Rhode Island 998 1190 31 2219 998 1435 24 2457 953 1527 22 2502
# 6 Vermont* 282 797 332 1411 302 923 326 1551 337 948 338 1623
我猜测州名中的星号是有意义的。这可以使用grepl
轻松捕获,然后删除:
dat3$ast <- grepl("\\*", dat3$V1)
dat3[[1]] <- gsub("\\*", "", dat3[[1]])
head(dat3)
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 ast
# 1 Connecticut 3779 2992 0 6771 3496 3483 0 6979 3612 3604 0 7216 TRUE
# 2 Maine 746 1767 267 2780 753 1510 270 2533 776 1605 274 2655 TRUE
# 3 Massachusetts 6359 5542 143 12044 6953 6771 174 13898 7411 7463 292 15166 FALSE
# 4 New Hampshire 491 660 175 1326 515 936 166 1617 523 1197 238 1958 FALSE
# 5 Rhode Island 998 1190 31 2219 998 1435 24 2457 953 1527 22 2502 FALSE
# 6 Vermont 282 797 332 1411 302 923 326 1551 337 948 338 1623 TRUE