我正在尝试阅读文件夹中的所有文本文件以及我正在做的事情:
我的代码是:
library(dplyr); library(readr); library(rvest); library(data.table);
# List all the text files in the folder
files = list.files(pattern="*.txt")
# read from file and append to rows
tbl = lapply(files, read_html %>% html_nodes("text") %>% html_text() ) %>% bind_rows()
这给我一个错误:
Error in UseMethod("xml_find_all") :
no applicable method for 'xml_find_all' applied to an object of class "function"
有人可以帮忙纠正我错在哪里吗?
答案 0 :(得分:2)
问题的核心是read_html %>% html_nodes("text") %>% html_text()
没有评估函数。您可以通过使用点启动管道来使用magrittr lambda,例如. %>% read_html %>% html_nodes("text") %>% html_text()
。
然后,最终html_text()
将为您提供一个向量,而不是您可以提供给bind_rows
的数据框。
您可以使用lapply
代替bind_rows
/ purrr::map_df()
:
library(purrr)
library(rvest)
map_df( files, ~ {
file <- .x
MyText <- read_html(file) %>%
html_nodes("text") %>%
html_text()
tibble( file, MyText )
} )
答案 1 :(得分:0)
这是我的解决方案。我已经检查了我的笔记本电脑,它正在运行:
# ________________ BELOW STEPS READS THE DATA SETS AND CREATES A DATAFRAME _______________________
# set default folder first
setwd("drive/you/folder/location")
# read text files from the folders
files <-list.files()
# create an empty dataframe
data <- data.frame()
# read files one by one and create dataframe
for (f in files) {
# read as HTML
dat <- read_html(f)
# from data extract everything within <TEXT> and </TEXT> tags
dat2 <- data.frame(Text = dat %>% html_nodes("text") %>% html_text() , stringsAsFactors = F)
# create new columns
dat3 <- data.frame(Text = strsplit(dat2$Text, " \\| ")[[1]], stringsAsFactors = F)
# create new columns from "Text"
dat4 <- data.frame(Text = strsplit(dat3$Text[[3]], ":")[[1]], stringsAsFactors = F)
# merge all the columns and rows after some basic text cleaning/processing
NewsData <- data.frame(News_Paper = trimws(dat3$Text[1], which = "both"),
News_Class = trimws(dat3$Text[2], which = "both"),
Author_Location_Date = gsub("\r?\n|\r|\t|\\s+", " ", trimws(dat4$Text[1], which = "both")),
Text = gsub("\r?\n|\r|\t|\\s+", " ", trimws(dat4$Text[2], which = "both"))
)
# merge all the rows from remaining text files in the folder, one by one
data <- rbind.data.frame(data, NewsData, make.row.names = F, stringsAsFactors = F)
}
# remove the unwanted dataframes
rm(list=c("dat2", "dat3", "dat4"))
# ________________ END OF THE ABOVE STEPS ___________________________________________________
我希望这会对你有所帮助。