Question

我想这里必须有一个简单的答案，但我似乎找不到它。

我正在抓取各种网页，我想从网页上下载所有链接。我正在使用htmlParse执行此操作，大约有95％的方式，但需要一些帮助。

这是我抓取网页的代码

MyURL <- "http://stackoverflow.com/"
MyPage <- htmlParse(MyURL) # Parse the web page
URLroot <- xmlRoot(MyPage) # Get root node

一旦有了根节点，我就可以运行它来获取节点

URL_Links <- xpathSApply(URLroot, "//a") # get all hrefs from root

这给了我这样的输出

[[724]]
<a href="//area51.stackexchange.com" title="proposing new sites in the Stack Exchange network">Area 51</a> 

[[725]]
<a href="//careers.stackoverflow.com">Stack Overflow Careers</a> 

[[726]]
<a href="http://creativecommons.org/licenses/by-sa/3.0/" rel="license">cc by-sa 3.0</a>

或者，我可以运行此

URL_Links_values = xpathSApply(URLroot, "//a", xmlGetAttr, "href") # Get all href values

只获得像这样的HREF值

[[721]]
[1] "http://creativecommons.org/licenses/by-sa/3.0/"

[[722]]
[1] "http://blog.stackoverflow.com/2009/06/attribution-required/"

然而，我正在寻找的方法是轻松获取HREF值和链接名称，最好加载到数据框或矩阵中，以便不返回

<a href="http://creativecommons.org/licenses/by-sa/3.0/" rel="license">cc by-sa 3.0</a> 
<a href="http://blog.stackoverflow.com/2009/06/attribution-required/" rel="license">attribution required</a>

我明白了

                  Name                                                        HREF
1         cc by-sa 3.0              http://creativecommons.org/licenses/by-sa/3.0/
2 attribution required http://blog.stackoverflow.com/2009/06/attribution-required/

现在我可以获取URL_Links的输出并执行一些正则表达式或拆分字符串以获取此数据，但似乎应该有一种更简单的方法来使用XML包。

有没有一种简单的方法可以做我想做的事情？

编辑：

刚想通知我可以这样做以获取URL名称

URL_Links_names <- xpathSApply(URLroot, "//a", xmlValue) # Get all href values

然而，当我运行这个

df <- data.frame(URL_Links_names, URL_Links_values)

我收到此错误

Error in data.frame("//stackoverflow.com", "http://chat.stackoverflow.com",  : arguments imply differing number of rows: 1, 0

我猜测有没有名字的链接，那么我如何才能为任何未命名的链接提交“”或NA？

Answer 1

html中似乎有几个丢失的href链接。由于xmlGetAttr()在没有请求的属性时返回NULL，因此您可以使用is.null()找到它们。然后，您可以将其置于if()条件中，以便为缺少的字符串包含空字符串，否则包含href属性。无需对根节点进行子集化。

library(XML)
## parse the html document
doc <- htmlParse("http://stackoverflow.com/")
## use the [.XMLNode accessor to drop into 'a' and then apply our functions
getvals <- lapply(doc["//a"], function(x) {
    data.frame(
        ## get the xml value
        Name = xmlValue(x, trim = TRUE), 
        ## get the href link if it exists
        HREF = if(is.null(att <- xmlGetAttr(x, "href"))) "" else att,
        stringsAsFactors = FALSE
    )
})
## create the full data frame
df <- do.call(rbind, getvals)
## have a look
str(df)
# 'data.frame': 697 obs. of  2 variables:
#  $ Name: chr  "current community" "chat" "Stack Overflow" "Meta Stack Overflow" ...
#  $ HREF: chr  "//stackoverflow.com" "http://chat.stackoverflow.com" "//stackoverflow.com" "http://meta.stackoverflow.com" ...

tail(df)
#                       Name                                                        HREF
# 692             Stack Apps                                             //stackapps.com
# 693    Meta Stack Exchange                                    //meta.stackexchange.com
# 694                Area 51                                  //area51.stackexchange.com
# 695 Stack Overflow Careers                                 //careers.stackoverflow.com
# 696           cc by-sa 3.0              http://creativecommons.org/licenses/by-sa/3.0/
# 697   attribution required http://blog.stackoverflow.com/2009/06/attribution-required/

Answer 2

我的目标是查看所有链接名称，然后确定我需要的URL。我没有找到一种方法来获取我想要的所有数据框架，但我能做的就是获得所有这样的链接名称

MyURL <- "http://stackoverflow.com/"
MyPage <- htmlParse(MyURL) # Parse the web page
URLroot <- xmlRoot(MyPage) # Get root node
URL_Links_names <- xpathSApply(URLroot, "//a", xmlValue) # Get all href values

这得到了我所有的链接名称。搜索名称并确定是否需要其中的部分或全部，然后您可以将链接名称传递给此函数，以根据链接名称获取每个链接的HREF值

GetLinkURLByName <- function(LinkName, WebPageURL) {
  LinkURL <- getHTMLLinks(WebPageURL, xpQuery = sprintf("//a[text()='%s']/@href",LinkName))
  return(LinkURL)
}

LinkName =来自URL_Links_Name的链接名称。 WebPageURL =您正在抓取的网页（在此示例中，我将其传递给MyURL）

R Scrape网页链接

2 个答案: