我正在尝试从以下链接下载的pdf中抓取数据,并将其存储为数据表进行分析。 https://www.ftse.com/products/downloads/FTSE_100_Constituent_history.pdf。
这是我到目前为止所拥有的;
require(pdftools)
require(data.table)
require(stringr)
url <- "https://www.ftse.com/products/downloads/FTSE_100_Constituent_history.pdf"
dfl <- pdf_text(url)
dfl <- dfl[2:(length(dfl)-1)]
dfl <- str_split(dfl, pattern = "(\n)")
该代码几乎可以正常工作,但是在注释列中,由于\ n,文本溢出到新页面上,我最终代码溢出到了新行。例如,在84年1月19日,“注释”列应显示为;
Corporate Event - Acquisition of Eagle Star by BAT Industries
但是使用我的代码,“ BAT工业”会溢出到新行中,而我希望它与上面的行在同一字符串中。
运行代码后,我希望拥有与pdf相同的表,并且所有文本都输入正确的列中。
谢谢。
答案 0 :(得分:1)
我们可以使用以下操作。
true
false
true
false
我猜想最终您将需要一个数据框而不是它们的列表。为此,您可以使用dfl <- pdf_text(url)
dfl <- dfl[2:(length(dfl) - 1)]
# Getting rid of the last line in every page
dfl <- gsub("\nFTSE Russell \\| FTSE 100 – Historic Additions and Deletions, November 2018[ ]+?\\d{1,2} of 12\n", "", dfl)
# Splitting not just by \n, but by \n that goes right before a date (positive lookahead)
dfl <- str_split(dfl, pattern = "(\n)(?=\\d{2}-\\w{3}-\\d{2})")
# For each page...
dfl <- lapply(dfl, function(df) {
# Split vectors into 4 columns (sometimes we may have 5 due to the issue that
# you mentioned, so str_split_fixed becomes useful) by possibly \n and
# at least two spaces.
df <- str_split_fixed(df, "(\n)*[ ]{2,}", 4)
# Replace any remaining (in the last columns) cases of possibly \n and
# at least two spaces.
df <- gsub("(\n)*[ ]{2,}", " ", df)
colnames(df) <- c("Date", "Added", "Deleted", "Notes")
df[df == ""] <- NA
data.frame(df[-1, ])
})
head(dfl[[1]])
# Date Added Deleted Notes
# 1 19-Jan-84 Charterhouse J Rothschild Eagle Star Corporate Event - Acquisition of Eagle Star by BAT Industries
# 2 02-Apr-84 Lonrho Magnet & Southerns <NA>
# 3 02-Jul-84 Reuters Edinburgh Investment Trust <NA>
# 4 02-Jul-84 Woolworths Barratt Development <NA>
# 5 19-Jul-84 Enterprise Oil Bowater Corporation Corporate Event - Sub division of company into Bowater Inds and Bowater Inc
# 6 01-Oct-84 Willis Faber Wimpey (George) & Co <NA>
。