我有3个不同的文本文件,它们的名称为txt1 txt2 txt3:
txt1 <- read_html("https://tr.wikisource.org/wiki/Recep_Tayyip_Erdo%C4%9Fan%27%C4%B1n_1_Haziran_2010_tarihli_AK_Parti_grup_toplant%C4%B1s%C4%B1_konu%C5%9Fmas%C4%B1")
txt2 <- read_html("https://tr.wikisource.org/wiki/Recep_Tayyip_Erdo%C4%9Fan%27%C4%B1n_1_Haziran_2011_tarihli_Diyarbak%C4%B1r_mitinginde_yapt%C4%B1%C4%9F%C4%B1_konu%C5%9Fma")
txt3 <- read_html("https://tr.wikisource.org/wiki/Recep_Tayyip_Erdo%C4%9Fan%27%C4%B1n_1_%C5%9Eubat_2011_tarihli_AK_Parti_grup_toplant%C4%B1s%C4%B1_konu%C5%9Fmas%C4%B1 " )
现在,我正在尝试创建一个唯一的html文本文件来分析所有这些文件,因为它们都是一个文件。 知道如何使用不同的html文本文件创建一个唯一的html文本文件吗?
答案 0 :(得分:1)
您在正确的道路上,这怎么办?
library(rvest)
library(tidyverse)
txt1 <- read_html("https://tr.wikisource.org/wiki/Recep_Tayyip_Erdo%C4%9Fan%27%C4%B1n_1_Haziran_2010_tarihli_AK_Parti_grup_toplant%C4%B1s%C4%B1_konu%C5%9Fmas%C4%B1")
txt2 <- read_html("https://tr.wikisource.org/wiki/Recep_Tayyip_Erdo%C4%9Fan%27%C4%B1n_1_Haziran_2011_tarihli_Diyarbak%C4%B1r_mitinginde_yapt%C4%B1%C4%9F%C4%B1_konu%C5%9Fma")
txt3 <- read_html("https://tr.wikisource.org/wiki/Recep_Tayyip_Erdo%C4%9Fan%27%C4%B1n_1_%C5%9Eubat_2011_tarihli_AK_Parti_grup_toplant%C4%B1s%C4%B1_konu%C5%9Fmas%C4%B1 " )
# first link
df1<- txt1 %>%
html_nodes('#mw-content-text p') %>% #choose the text
html_text() %>%
t() %>% # transpose
data.frame() %>% # as data.frame
unite() # melt all the cell in one text
第二个和第三个链接也是如此:
df2<- txt2 %>%
html_nodes('#mw-content-text p') %>%
html_text() %>% t() %>% data.frame() %>%unite()
df3<- txt3 %>%
html_nodes('#mw-content-text p') %>%
html_text() %>% t() %>% data.frame() %>%unite()
最后,例如,将所有内容放到一个单元格中:
df_total <- cbind(df1,df2,df3) %>% unite()
编辑:
您可以创建一个循环,以解析链接矢量中的所有页面:
txt1 <- ("https://tr.wikisource.org/wiki/Recep_Tayyip_Erdo%C4%9Fan%27%C4%B1n_1_Haziran_2010_tarihli_AK_Parti_grup_toplant%C4%B1s%C4%B1_konu%C5%9Fmas%C4%B1")
txt2 <- ("https://tr.wikisource.org/wiki/Recep_Tayyip_Erdo%C4%9Fan%27%C4%B1n_1_Haziran_2011_tarihli_Diyarbak%C4%B1r_mitinginde_yapt%C4%B1%C4%9F%C4%B1_konu%C5%9Fma")
txt3 <- ("https://tr.wikisource.org/wiki/Recep_Tayyip_Erdo%C4%9Fan%27%C4%B1n_1_%C5%9Eubat_2011_tarihli_AK_Parti_grup_toplant%C4%B1s%C4%B1_konu%C5%9Fmas%C4%B1 " )
url <- c(txt1, txt2, txt3) # all the urls
# the loop that scrapes and put in a list
dfList <- lapply(url, function(i)
{
swimwith <- read_html(i)
swdf <- swimwith %>%
html_nodes('#mw-content-text p') %>%
html_text()%>%
t() %>%
data.frame() %>%
unite()
})
# from list to df
finaldf1 <- do.call(cbind, dfList) %>% unite()