我想抓取一些数据(30个文档)并编写以下例程:
library(pacman)
pacman::p_load(ggplot2,gridExtra,reshape2,microbenchmark,pdftools,stringr,tidyverse,tm,tidytext,textdata,htm2txt,XML,rvest,lubridate,stringr,rvest,purrr,install = TRUE, update = F)
body<-read_html("https://www.ecb.europa.eu/press/accounts/2015/html/mg151119.en.html")
body<-body %>%
html_nodes("#ecb-content-col") %>%
html_text()%>%
readr::read_lines()%>%
str_replace_all(pattern = "[\\^]", replacement = " ") %>%
str_replace_all(pattern = "\"", replacement = " ") %>%
str_replace_all(pattern = "\\s+", replacement = " ") %>%
str_replace_all(pattern = "\t", replacement = " ") %>%
str_trim(side = "both")
body=body[body!=""]
test1=tibble(text=body)
head(test1)
# A tibble: 6 x 1
text
<chr>
1 Account of the monetary policy meeting
2 19 November 2015
3 of the Governing Council of the European Central Bank,held in Malta on Thursday, 22 October 2015
4 1. Review of financial, economic and monetary developments and policy options
5 Financial market developments
6 Mr Cœuré reviewed recent financial market developments.
每个段落都有一行。当我进入2017年时,网站的结构似乎发生了变化(即使我看不到HTML代码的结构):
body<-read_html("https://www.ecb.europa.eu/press/accounts/2017/html/ecb.mg170518.en.html")
body<-body %>%
html_nodes("#ecb-content-col") %>%
html_text()%>%
readr::read_lines()%>%
str_replace_all(pattern = "[\\^]", replacement = " ") %>%
str_replace_all(pattern = "\"", replacement = " ") %>%
str_replace_all(pattern = "\\s+", replacement = " ") %>%
str_replace_all(pattern = "\t", replacement = " ") %>%
str_trim(side = "both")
body=body[body!=""]
test2=tibble(text=body)
head(test2)
# A tibble: 6 x 1
text
<chr>
1 Account of the monetary policy meetingof the Governing Council of the European Central Bank, held in Frankfurt am Mai~
2 Links to other language versions (external):
3 Deutsche Bundesbank
4 Banque de France
5 Rotation rights
6 Rotation of voting rights of ECB Governing Council members
这是第1行中的全部正文。单个段落粘合在一起而没有空格。是否可以添加一些人工的行分隔符\ n以获得原始结构(test1)?