Question

我知道这个问题在很多时候被提出过，但在阅读了很多主题之后我仍然坚持这个:(。我已经列出了像这样的HTML节点

<a href="http://bit.d o/bnRinN9" target="_blank" style="color: #ff7700; font-weight: bold;">http://bit.d o/bnRinN9</a>

我只想清理所有代码部分。不幸的是，我是一个新手，我唯一想到的就是克苏鲁的方式（正则表达式，唉！）。我可以这样做？

*我在＆＃34; d＆＃34;之间放了一个空格。和＆＃34; o＆＃34;在域名中因为SO不允许发布该链接

Answer 1

这会使用已下载的Why R can't scrape these links?中链接的数据。

library(rvest)
library(stringr)

# read the saved htm page and make one string
lines <- readLines("~/Downloads/testlink.html")
text <- paste0(lines, collapse = "\n")

# the lnks are within a table, within spans. There issnt much structure
# and no identfiers so it needs a little hacking to get the right elements
# There probably are smarter css selectors that could avoid the hacks
spans <- read_html(text) %>% xml_nodes(css = "table tbody tr td span")

# extract all the short links -- but remove the links to edit
# note these links have a trailing dash - links to the statistics 
# not the content
short_links <- spans %>% xml_nodes("a") %>% xml_attr("href")
short_links <- short_links[!str_detect(short_links, "/edit")]

# the real urls are in the html text, prefixed with http
span_text  <- spans %>% html_text() %>% unlist()
long_links <- span_text[str_detect(span_text, "http")]

# > short_links
# [1] "http://bit.dxo/scrprtest7-" "http://bit.dxo/scrprtest6-" "http://bit.dxo/scrprtest5-" "http://bit.dxo/scrprtest4-" "http://bit.dxo/scrprtest3-"
# [6] "http://bit.dxo/scrprtest2-" "http://bit.dox/scrprtest1-"
# > long_links
# [1] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [2] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [3] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [4] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [5] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [6] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [7] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"

Answer 2

library rvest包含许多用于抓取和处理html的简单函数。它取决于包xml2。通常，您可以一步完成刮擦和过滤。

目前尚不清楚是否要提取href值或html文本，这些都与示例中的相同。此代码通过查找a节点然后查找html属性href来提取href值。或者，您可以使用html_text来获取链接显示文本。

library(rvest)
links <- list('
<a href="http://anydomain.com/bnRinN9" target="_blank" style="color: #ff7700; font-weight: bold;">http://anydomain.com/bnRinN9</a>
<a href="domain.com/page">
')

# make one string
text <- paste0(links, collapse = "\n")
hrefs <- read_html(text) %>% xml_nodes("a") %>% xml_attr("href")
hrefs  


## [1] "http://anydomain.com/bnRinN9" "domain.com/page"

清理R中的HTML代码：如何清理此列表？

2 个答案: