Question

我正在尝试阅读文件夹中的所有文本文件以及我正在做的事情：

从特定的html标签“TEXT”
存储列名为“MyText”的数据框
在从下一个文本文件（如上所述）

我的代码是：

library(dplyr); library(readr); library(rvest); library(data.table); 

# List all the text files in the folder
files = list.files(pattern="*.txt")

# read from file and append to rows
tbl = lapply(files, read_html %>% html_nodes("text") %>%  html_text() ) %>% bind_rows()

这给我一个错误：

Error in UseMethod("xml_find_all") : 
  no applicable method for 'xml_find_all' applied to an object of class "function"

有人可以帮忙纠正我错在哪里吗？

Answer 1

问题的核心是read_html %>% html_nodes("text") %>% html_text()没有评估函数。您可以通过使用点启动管道来使用magrittr lambda，例如. %>% read_html %>% html_nodes("text") %>% html_text()。

然后，最终html_text()将为您提供一个向量，而不是您可以提供给bind_rows的数据框。

您可以使用lapply代替bind_rows / purrr::map_df()：

library(purrr)
library(rvest)
map_df( files, ~ {
  file   <- .x
  MyText <- read_html(file) %>%
    html_nodes("text") %>%
    html_text() 
  tibble( file, MyText )
} )

Answer 2

这是我的解决方案。我已经检查了我的笔记本电脑，它正在运行：

# ________________ BELOW STEPS READS THE DATA SETS AND CREATES A DATAFRAME _______________________ 

# set default folder first
setwd("drive/you/folder/location")    

# read text files from the folders 
files <-list.files()

# create an empty dataframe
data <- data.frame()

# read files one by one and create dataframe
for (f in files) {

  # read as HTML
  dat <- read_html(f)

  # from data extract everything within <TEXT> and </TEXT> tags
  dat2 <- data.frame(Text = dat %>% html_nodes("text") %>%  html_text() , stringsAsFactors = F)

  # create new columns
  dat3 <- data.frame(Text = strsplit(dat2$Text, " \\| ")[[1]], stringsAsFactors = F)

  # create new columns from "Text"
  dat4 <- data.frame(Text = strsplit(dat3$Text[[3]], ":")[[1]], stringsAsFactors = F)

  # merge all the columns and rows after some basic text cleaning/processing
  NewsData <- data.frame(News_Paper = trimws(dat3$Text[1], which = "both"),
                         News_Class = trimws(dat3$Text[2], which = "both"),
                         Author_Location_Date = gsub("\r?\n|\r|\t|\\s+", " ", trimws(dat4$Text[1], which = "both")),
                         Text = gsub("\r?\n|\r|\t|\\s+", " ", trimws(dat4$Text[2], which = "both"))
  ) 

  # merge all the rows from remaining text files in the folder, one by one
  data <- rbind.data.frame(data, NewsData, make.row.names = F, stringsAsFactors = F)

} 

 # remove the unwanted dataframes
 rm(list=c("dat2", "dat3", "dat4"))


# ________________ END OF THE ABOVE STEPS ___________________________________________________

我希望这会对你有所帮助。

读取文本文件（包含HTML标记）并附加到数据框

2 个答案: