提取具有不同空格的PDF数据作为分隔符

时间:2019-07-17 09:22:55

标签: r pdf pdf-scraping

我正在寻找从this PDF中获取数据的方法。

我遇到了一个问题,其中将包含多个单词的位置名称(例如“ Northern Island”)放在不同的列中。

“ read.table”中的“ sep”参数似乎只能读取一个空格作为定界符。理想情况下,我希望具有多个空格的任何内容都可以用作分隔符。这有可能吗?


url <- "C:/Users/files/PSSS Weekly Bulletin - W1 2019 (Dec 31-Jan 06).pdf"

# Convert the PDF to a text string
txt <- pdf_text(url)

# get the working directory
wd <- getwd()

#write the file to the working directory
file_name <- paste0(wd, "/", "temp.txt")
write(txt, file = file_name, sep = "\t")

# Convert to a table. Data is located starting line 25, and lasts 25 lines
# P.S: I've tried this code with and without the "sep" argument. No change. 
dtaPCF <- read.table(file_name, skip = 24, nrows = 25, fill = TRUE, header = TRUE)

# Here is the text that I'd like to read.table with. Ideally, I'd want to keep the headers, but it's not a dealbreaker if that doesn't work.


Country/Area      No. sites  No. reported  % reported  AFR  Diarrhoea  ILI  PF  DLI

American Samoa   0          0             0%          0    0          0    0   0

Cook Islands     13         11            85%         0    3          3    0   0

FSM              4          3             75%         0    21         74   0   3

Fiji             0          0             0%          0    0          0    0   0

French Polynesia 31         16            52%         3    9          11   3   3

Guam             0          0             0%          0    0          0    0   0

Kiribati         7          7             100%        0    172        609  22  0

Marshall Islands 2          2             100%        0    4          0    2   0

N Mariana Is     7          7             100%        4    13         60   17  0

Nauru            0          0             0%          0    0          0    0   0

New Caledonia    0          0             0%          0    0          0    0   0

New Zealand      0          0             0%          0    0          0    0   0

Niue             0          0             0%          0    0          0    0   0

PNG              0          0             0%          0    0          0    0   0

Palau            0          0             0%          0    0          0    0   0

Pitcairn Islands 1          1             100%        0    0          0    0   0

Samoa            13         6             46%         0    262        606  18  4

Solomon Islands  13         4             31%         0    75         59   4   1

Tokelau          2          2             100%        0    2          9    0   0

Tonga            11         11            100%        0    17         73   0   0

Tuvalu           0          0             0%          0    0          0    0   0

Vanuatu          11         7             64%         0    49         171  0   1

Wallis & Futuna  0          0             0%          0    0          0    0   0

1 个答案:

答案 0 :(得分:0)

这是我最终使用的代码。我使用记事本检查每列的最大字符长度,并将其用于fwf_widths()。

library(readr)

dtaPCF <- read_fwf(file_name,
                   skip = 47,
                   n_max = 23,
                   trim_ws = TRUE,
                   fwf_widths(c(17, 11, 14, 12, 5, 11, 5, 4, 1), 
                              c("Country/Area", "No. sites", "No. reported", 
                                "% reported", "AFR", "Diarrhoea", "ILI", "PF", "DLI")))