如何在rvest网页中保留格式

时间:2016-07-05 04:20:22

标签: r web-scraping shiny rvest

有一个示例网页我希望获取一个歌词,我想在一个闪亮的应用程序中复制布局,可能在renderUI()函数内

People all over the world (everybody) 
Join hands (join)
Start a love train, love train
People all over the world (all the world, now)
Join hands (love ride)
Start a love train (love ride), love train

The next stop that we make will be soon (etc)

使用rvest我可以获取节点集和纯文本,但不清楚以原始格式显示文本的最佳方式。

library(rvest)
url <- "https://play.google.com/music/preview/Ttyni4p5vi3ohx52e7ye7m37hlm?lyrics=1&utm_source=google&utm_medium=search&utm_campaign=lyrics&pcampaignid=kp-lyrics&sa=X&ved=0ahUKEwiV7oXtqtvNAhVB5GMKHTnHDZEQr6QBCBsoADAB"

 read_html(url) %>%
   html_nodes("p")

{xml_nodeset (6)}
[1] <p>People all over the world (everybody)<br/>Join hands (join)<br/>Start         a love train, love train<br/>People all over the world (a ...
[2] <p>The next stop that we make will be soon<br/>Tell all the folks in Russia, and China, too<br/>Don't you know that it's time to g ...

read_html(url) %>%
   html_nodes("p") %>% 
   html_text()

[1] "People all over the world (everybody)Join hands (join)Start a love train, love trainPeople all over the world (all the world, now)Join hands (love ride)Start a love train (love ride), love train"                                                                                                                                                                                                            
[2] "The next stop that we make will be soonTell all the folks in Russia, and China, tooDon't you know that it's time to get on boardAnd let this train keep on riding, riding on throughWell, well"

TIA

1 个答案:

答案 0 :(得分:2)

您可以借用Set Next Serial Value,它将所有子元素(文本和标签)分开。由于xml2::xml_contents使用rvest来处理xml2之类的内容,因此该函数应该已经可用而无需显式调用read_html(如果您愿意,可以继续使用)。

如果您添加library(xml2),则可以嵌套每个purrr::map标记的子项,这样您就可以分出经文。如果您不喜欢另外一个包,在这个例子中它除了最后一个之外与<p>大致相同,所以我在评论中添加了基本版本。

lapply