Question

您好我正在尝试将多个pdf转换为文本，我的代码正在运行，但是我的大部分文件都是西班牙语，其中包含（ñ，í，ó，ú，é）等字符（ñ，í），ó，ú，é）正在腐败。此外，我需要文本文件为小写，以便稍后进行文本分析：

library(XML)
  library(httr)
  library(dplyr)
  library(tidyr)
  library(stringr)
  library(tm)

  # Get a list of all of the document names of the downloaded PDFs
    pdf_files <- list.files(path = paste(getwd(), '/pdf', sep = ''),
                            pattern = 'pdf',
                            full.names = TRUE)

    # Check there are pdf files in directory
    if( length(pdf_files) > 0 ){

      # Loop through each PDF and create a txt version in the same folder

      for(i in pdf_files){

        system(
          paste(
            paste('"', getwd(), '/dependencies/xpdf/bin64/pdftotext.exe"', sep = ''), 
            paste0('"', i, '"')),
          wait = FALSE)

      }
    }


  cat( '\nConversion to text complete.\n\n' )

将pdf（带特殊字符）转换为文本

0 个答案: