清理R中的HTML代码:如何清理此列表?

时间:2017-08-08 09:32:51

标签: r regex gsub

我知道这个问题在很多时候被提出过,但在阅读了很多主题之后我仍然坚持这个:(。我已经列出了像这样的HTML节点

<a href="http://bit.d o/bnRinN9" target="_blank" style="color: #ff7700; font-weight: bold;">http://bit.d o/bnRinN9</a>

我只想清理所有代码部分。不幸的是,我是一个新手,我唯一想到的就是克苏鲁的方式(正则表达式,唉!)。我可以这样做?

*我在&#34; d&#34;之间放了一个空格。和&#34; o&#34;在域名中因为SO不允许发布该链接

2 个答案:

答案 0 :(得分:1)

这会使用已下载的Why R can't scrape these links?中链接的数据。

library(rvest)
library(stringr)

# read the saved htm page and make one string
lines <- readLines("~/Downloads/testlink.html")
text <- paste0(lines, collapse = "\n")

# the lnks are within a table, within spans. There issnt much structure
# and no identfiers so it needs a little hacking to get the right elements
# There probably are smarter css selectors that could avoid the hacks
spans <- read_html(text) %>% xml_nodes(css = "table tbody tr td span")

# extract all the short links -- but remove the links to edit
# note these links have a trailing dash - links to the statistics 
# not the content
short_links <- spans %>% xml_nodes("a") %>% xml_attr("href")
short_links <- short_links[!str_detect(short_links, "/edit")]

# the real urls are in the html text, prefixed with http
span_text  <- spans %>% html_text() %>% unlist()
long_links <- span_text[str_detect(span_text, "http")]

# > short_links
# [1] "http://bit.dxo/scrprtest7-" "http://bit.dxo/scrprtest6-" "http://bit.dxo/scrprtest5-" "http://bit.dxo/scrprtest4-" "http://bit.dxo/scrprtest3-"
# [6] "http://bit.dxo/scrprtest2-" "http://bit.dox/scrprtest1-"
# > long_links
# [1] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [2] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [3] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [4] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [5] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [6] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [7] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"

答案 1 :(得分:0)

library rvest包含许多用于抓取和处理html的简单函数。它取决于包xml2。通常,您可以一步完成刮擦和过滤。

目前尚不清楚是否要提取href值或html文本,这些都与示例中的相同。此代码通过查找a节点然后查找html属性href来提取href值。或者,您可以使用html_text来获取链接显示文本。

library(rvest)
links <- list('
<a href="http://anydomain.com/bnRinN9" target="_blank" style="color: #ff7700; font-weight: bold;">http://anydomain.com/bnRinN9</a>
<a href="domain.com/page">
')

# make one string
text <- paste0(links, collapse = "\n")
hrefs <- read_html(text) %>% xml_nodes("a") %>% xml_attr("href")
hrefs  


## [1] "http://anydomain.com/bnRinN9" "domain.com/page"