在html解析期间以编程方式分配变量值

时间:2015-04-09 19:41:37

标签: xml r

我正在扩展关于html解析的previous question以包含有关空值的问题。假设我从HTML中提取的某些变量有空值。有多个变量可能是空的,所以我想要一个系统的方法来处理它们(循环或函数)。

这个问题实际上是关于以编程方式分配变量,我发现的大部分信息都建议避免使用eval(parse(text,但我不确定在这种情况下如何替换它。我有以下HTML:

html <- 
'<!DOCTYPE html>
<html>
    <body>
        <div class="foo">
            <div class="fooname">Name of 1st foo</div>
            <div class="abc">ABC value only present here</div>
            <span>1st span in 1st foo</span>
            <span>2nd span in 1st foo</span>
        </div>

        <div class="foo">
            <div class="fooname">Name of 2nd foo</div>
            <span>Only 1 span in 2nd foo</span>
        </div>
    </body>
</html>'

以下是解析:

library(XML)

html.parse <- htmlParse(html)

myFunc <- function(x){
    fooname <- xpathSApply(x, "./div[@class='fooname']", fun = xmlValue)
    abc <- xpathSApply(x, "./div[@class='abc']", fun = xmlValue)
    span <- xpathSApply(x, "./span", fun = xmlValue)

    df <- data.frame(fooname, abc, Span1 = span[1], Span2 = span[2])
    return(df)
}

result <- getNodeSet(html.parse, "//div[@class='foo']", fun = myFunc)

#  Error in data.frame(fooname, abc, Span1 = span[1], Span2 = span[2]) : 
#   arguments imply differing number of rows: 1, 0 

这是我的尝试修复。

myFunc <- function(x){
    fooname <- xpathSApply(x, "./div[@class='fooname']", fun = xmlValue)
    abc <- xpathSApply(x, "./div[@class='abc']", fun = xmlValue)
    span <- xpathSApply(x, "./span", fun = xmlValue)


    dfvars <- c("fooname", "abc", "span")

    #I think I have the same issue about assigning a variable in `apply`
        #functions, right?

    for(var in dfvars) {

        if(length(eval(parse(text = var))) == 0) {
            cat("No ", var, " value found for this group.\n")

            #Note the "list" class:
            cat("Class of ", var, " is: ", class(eval(parse(text = var))), "\n")
            cat("Placing an NA.\n")

            #This line gives an error:
            assign(eval(parse(text = var)), as.character(NA))

            cat("new value of ", var, " : ", eval(parse(text = var)), "\n")
            cat("New length of ", var, " : ", length(eval(parse(text = var))), "\n")
            cat("New class of ", var, " : ", class(eval(parse(text = var))), "\n")

        }
    }

    df <- data.frame(fooname, abc, Span1 = span[1], Span2 = span[2])
    return(df)
}

result <- getNodeSet(html.parse, "//div[@class='foo']", fun = myFunc)

#  Error in assign(eval(parse(text = var)), as.character(NA)) : 
#   invalid first argument 

请注意,虽然这里for循环(或apply函数,如果我这样做)是在第二个嵌套层。在我的真实项目中,它排在第三;外层在一系列页面中打开。如果可能的话,尽量避免进入第三级会很好,但我也想让事情变得简单。

1 个答案:

答案 0 :(得分:1)

您可以定义自己的xpathSApply函数来测试list()

myXpathSApply <- function(x, ...){
  y <- xpathSApply(x, ...)
  if(length(y) > 0){y}else{NA}
}

并在使用xpathSApply

的地方使用此功能
myFunc <- function(x){
    fooname <- myXpathSApply(x, "./div[@class='fooname']", fun = xmlValue)
    abc <- myXpathSApply(x, "./div[@class='abc']", fun = xmlValue)
    span <- myXpathSApply(x, "./span", fun = xmlValue)

    df <- data.frame(fooname, abc, Span1 = span[1], Span2 = span[2])
    return(df)
}