Question

我正在尝试从以下链接下载的pdf中抓取数据，并将其存储为数据表进行分析。 https://www.ftse.com/products/downloads/FTSE_100_Constituent_history.pdf。

这是我到目前为止所拥有的；

require(pdftools)
require(data.table)
require(stringr)

url <- "https://www.ftse.com/products/downloads/FTSE_100_Constituent_history.pdf"

dfl <- pdf_text(url)
dfl <- dfl[2:(length(dfl)-1)]
dfl <- str_split(dfl, pattern = "(\n)")

该代码几乎可以正常工作，但是在注释列中，由于\ n，文本溢出到新页面上，我最终代码溢出到了新行。例如，在84年1月19日，“注释”列应显示为；

Corporate Event - Acquisition of Eagle Star by BAT Industries

但是使用我的代码，“ BAT工业”会溢出到新行中，而我希望它与上面的行在同一字符串中。

运行代码后，我希望拥有与pdf相同的表，并且所有文本都输入正确的列中。

谢谢。

Answer 1

我们可以使用以下操作。

true
false
true
false

我猜想最终您将需要一个数据框而不是它们的列表。为此，您可以使用dfl <- pdf_text(url) dfl <- dfl[2:(length(dfl) - 1)] # Getting rid of the last line in every page dfl <- gsub("\nFTSE Russell \\| FTSE 100 – Historic Additions and Deletions, November 2018[ ]+?\\d{1,2} of 12\n", "", dfl) # Splitting not just by \n, but by \n that goes right before a date (positive lookahead) dfl <- str_split(dfl, pattern = "(\n)(?=\\d{2}-\\w{3}-\\d{2})") # For each page... dfl <- lapply(dfl, function(df) { # Split vectors into 4 columns (sometimes we may have 5 due to the issue that # you mentioned, so str_split_fixed becomes useful) by possibly \n and # at least two spaces. df <- str_split_fixed(df, "(\n)*[ ]{2,}", 4) # Replace any remaining (in the last columns) cases of possibly \n and # at least two spaces. df <- gsub("(\n)*[ ]{2,}", " ", df) colnames(df) <- c("Date", "Added", "Deleted", "Notes") df[df == ""] <- NA data.frame(df[-1, ]) }) head(dfl[[1]]) # Date Added Deleted Notes # 1 19-Jan-84 Charterhouse J Rothschild Eagle Star Corporate Event - Acquisition of Eagle Star by BAT Industries # 2 02-Apr-84 Lonrho Magnet & Southerns <NA> # 3 02-Jul-84 Reuters Edinburgh Investment Trust <NA> # 4 02-Jul-84 Woolworths Barratt Development <NA> # 5 19-Jul-84 Enterprise Oil Bowater Corporation Corporate Event - Sub division of company into Bowater Inds and Bowater Inc # 6 01-Oct-84 Willis Faber Wimpey (George) & Co <NA>。

清除PDF文件中的数据

1 个答案: