读取文本文件(包含HTML标记)并附加到数据框

时间:2017-09-23 12:26:12

标签: r dplyr data.table rvest readr

我正在尝试阅读文件夹中的所有文本文件以及我正在做的事情:

  1. 从特定的html标签“TEXT”
  2. 中读取每个文本文件
  3. 存储列名为“MyText”的数据框
  4. 在从下一个文本文件(如上所述)
  5. 读取后附加下一行

    我的代码是:

    library(dplyr); library(readr); library(rvest); library(data.table); 
    
    # List all the text files in the folder
    files = list.files(pattern="*.txt")
    
    # read from file and append to rows
    tbl = lapply(files, read_html %>% html_nodes("text") %>%  html_text() ) %>% bind_rows()
    

    这给我一个错误:

    Error in UseMethod("xml_find_all") : 
      no applicable method for 'xml_find_all' applied to an object of class "function"
    

    有人可以帮忙纠正我错在哪里吗?

2 个答案:

答案 0 :(得分:2)

问题的核心是read_html %>% html_nodes("text") %>% html_text()没有评估函数。您可以通过使用点启动管道来使用magrittr lambda,例如. %>% read_html %>% html_nodes("text") %>% html_text()

然后,最终html_text()将为您提供一个向量,而不是您可以提供给bind_rows的数据框。

您可以使用lapply代替bind_rows / purrr::map_df()

library(purrr)
library(rvest)
map_df( files, ~ {
  file   <- .x
  MyText <- read_html(file) %>%
    html_nodes("text") %>%
    html_text() 
  tibble( file, MyText )
} )

答案 1 :(得分:0)

这是我的解决方案。我已经检查了我的笔记本电脑,它正在运行:

# ________________ BELOW STEPS READS THE DATA SETS AND CREATES A DATAFRAME _______________________ 

# set default folder first
setwd("drive/you/folder/location")    

# read text files from the folders 
files <-list.files()

# create an empty dataframe
data <- data.frame()

# read files one by one and create dataframe
for (f in files) {

  # read as HTML
  dat <- read_html(f)

  # from data extract everything within <TEXT> and </TEXT> tags
  dat2 <- data.frame(Text = dat %>% html_nodes("text") %>%  html_text() , stringsAsFactors = F)

  # create new columns
  dat3 <- data.frame(Text = strsplit(dat2$Text, " \\| ")[[1]], stringsAsFactors = F)

  # create new columns from "Text"
  dat4 <- data.frame(Text = strsplit(dat3$Text[[3]], ":")[[1]], stringsAsFactors = F)

  # merge all the columns and rows after some basic text cleaning/processing
  NewsData <- data.frame(News_Paper = trimws(dat3$Text[1], which = "both"),
                         News_Class = trimws(dat3$Text[2], which = "both"),
                         Author_Location_Date = gsub("\r?\n|\r|\t|\\s+", " ", trimws(dat4$Text[1], which = "both")),
                         Text = gsub("\r?\n|\r|\t|\\s+", " ", trimws(dat4$Text[2], which = "both"))
  ) 

  # merge all the rows from remaining text files in the folder, one by one
  data <- rbind.data.frame(data, NewsData, make.row.names = F, stringsAsFactors = F)

} 

 # remove the unwanted dataframes
 rm(list=c("dat2", "dat3", "dat4"))


# ________________ END OF THE ABOVE STEPS ___________________________________________________ 

我希望这会对你有所帮助。