您好我正在尝试将多个pdf转换为文本,我的代码正在运行,但是我的大部分文件都是西班牙语,其中包含(ñ,í,ó,ú,é)等字符(ñ,í) ,ó,ú,é)正在腐败。此外,我需要文本文件为小写,以便稍后进行文本分析:
library(XML)
library(httr)
library(dplyr)
library(tidyr)
library(stringr)
library(tm)
# Get a list of all of the document names of the downloaded PDFs
pdf_files <- list.files(path = paste(getwd(), '/pdf', sep = ''),
pattern = 'pdf',
full.names = TRUE)
# Check there are pdf files in directory
if( length(pdf_files) > 0 ){
# Loop through each PDF and create a txt version in the same folder
for(i in pdf_files){
system(
paste(
paste('"', getwd(), '/dependencies/xpdf/bin64/pdftotext.exe"', sep = ''),
paste0('"', i, '"')),
wait = FALSE)
}
}
cat( '\nConversion to text complete.\n\n' )