抓取数据时插入换行符

时间:2019-07-02 08:26:36

标签: html r web-scraping rvest

我想抓取一些数据(30个文档)并编写以下例程:

library(pacman)
pacman::p_load(ggplot2,gridExtra,reshape2,microbenchmark,pdftools,stringr,tidyverse,tm,tidytext,textdata,htm2txt,XML,rvest,lubridate,stringr,rvest,purrr,install = TRUE, update = F)

body<-read_html("https://www.ecb.europa.eu/press/accounts/2015/html/mg151119.en.html")
body<-body %>% 
html_nodes("#ecb-content-col") %>% 
html_text()%>%
readr::read_lines()%>%
str_replace_all(pattern = "[\\^]", replacement = " ") %>%
str_replace_all(pattern = "\"", replacement = " ") %>%
str_replace_all(pattern = "\\s+", replacement = " ") %>%
str_replace_all(pattern = "\t", replacement = " ") %>%
str_trim(side = "both")
body=body[body!=""]
test1=tibble(text=body) 

head(test1)
# A tibble: 6 x 1
text                                                                                            
<chr>                                                                                           
1 Account of the monetary policy meeting                                                          
2 19 November 2015                                                                                
3 of the Governing Council of the European Central Bank,held in Malta on Thursday, 22 October 2015
4 1. Review of financial, economic and monetary developments and policy options                   
5 Financial market developments                                                                   
6 Mr Cœuré reviewed recent financial market developments.

每个段落都有一行。当我进入2017年时,网站的结构似乎发生了变化(即使我看不到HTML代码的结构):

body<-read_html("https://www.ecb.europa.eu/press/accounts/2017/html/ecb.mg170518.en.html")
body<-body %>% 
html_nodes("#ecb-content-col") %>% 
html_text()%>%
readr::read_lines()%>%
str_replace_all(pattern = "[\\^]", replacement = " ") %>%
str_replace_all(pattern = "\"", replacement = " ") %>%
str_replace_all(pattern = "\\s+", replacement = " ") %>%
str_replace_all(pattern = "\t", replacement = " ") %>%
str_trim(side = "both")
body=body[body!=""]
test2=tibble(text=body)
head(test2)
# A tibble: 6 x 1
text                                                                                                                  
<chr>                                                                                                                 
1 Account of the monetary policy meetingof the Governing Council of the European Central Bank, held in Frankfurt am Mai~
2 Links to other language versions (external):                                                                          
3 Deutsche Bundesbank                                                                                                   
4 Banque de France                                                                                                      
5 Rotation rights                                                                                                       
6 Rotation of voting rights of ECB Governing Council members  

这是第1行中的全部正文。单个段落粘合在一起而没有空格。是否可以添加一些人工的行分隔符\ n以获得原始结构(test1)?

0 个答案:

没有答案