我正在寻找从this PDF中获取数据的方法。
我遇到了一个问题,其中将包含多个单词的位置名称(例如“ Northern Island”)放在不同的列中。
“ read.table”中的“ sep”参数似乎只能读取一个空格作为定界符。理想情况下,我希望具有多个空格的任何内容都可以用作分隔符。这有可能吗?
url <- "C:/Users/files/PSSS Weekly Bulletin - W1 2019 (Dec 31-Jan 06).pdf"
# Convert the PDF to a text string
txt <- pdf_text(url)
# get the working directory
wd <- getwd()
#write the file to the working directory
file_name <- paste0(wd, "/", "temp.txt")
write(txt, file = file_name, sep = "\t")
# Convert to a table. Data is located starting line 25, and lasts 25 lines
# P.S: I've tried this code with and without the "sep" argument. No change.
dtaPCF <- read.table(file_name, skip = 24, nrows = 25, fill = TRUE, header = TRUE)
# Here is the text that I'd like to read.table with. Ideally, I'd want to keep the headers, but it's not a dealbreaker if that doesn't work.
Country/Area No. sites No. reported % reported AFR Diarrhoea ILI PF DLI
American Samoa 0 0 0% 0 0 0 0 0
Cook Islands 13 11 85% 0 3 3 0 0
FSM 4 3 75% 0 21 74 0 3
Fiji 0 0 0% 0 0 0 0 0
French Polynesia 31 16 52% 3 9 11 3 3
Guam 0 0 0% 0 0 0 0 0
Kiribati 7 7 100% 0 172 609 22 0
Marshall Islands 2 2 100% 0 4 0 2 0
N Mariana Is 7 7 100% 4 13 60 17 0
Nauru 0 0 0% 0 0 0 0 0
New Caledonia 0 0 0% 0 0 0 0 0
New Zealand 0 0 0% 0 0 0 0 0
Niue 0 0 0% 0 0 0 0 0
PNG 0 0 0% 0 0 0 0 0
Palau 0 0 0% 0 0 0 0 0
Pitcairn Islands 1 1 100% 0 0 0 0 0
Samoa 13 6 46% 0 262 606 18 4
Solomon Islands 13 4 31% 0 75 59 4 1
Tokelau 2 2 100% 0 2 9 0 0
Tonga 11 11 100% 0 17 73 0 0
Tuvalu 0 0 0% 0 0 0 0 0
Vanuatu 11 7 64% 0 49 171 0 1
Wallis & Futuna 0 0 0% 0 0 0 0 0
答案 0 :(得分:0)
这是我最终使用的代码。我使用记事本检查每列的最大字符长度,并将其用于fwf_widths()。
library(readr)
dtaPCF <- read_fwf(file_name,
skip = 47,
n_max = 23,
trim_ws = TRUE,
fwf_widths(c(17, 11, 14, 12, 5, 11, 5, 4, 1),
c("Country/Area", "No. sites", "No. reported",
"% reported", "AFR", "Diarrhoea", "ILI", "PF", "DLI")))