函数抛出应用于向量但不是一个元素的错误

时间:2017-01-25 23:49:45

标签: r

我已经重写了以下内容以澄清这个问题和解决方案,并将功能和解决方案留在底部作为示例。再次感谢John Coleman的帮助!

问题:我创建了一个数据scrape函数,它在传递一个url时起作用,但不是一个url向量,抛出了这个错误:

Error in data.frame(address, recipename, prept, cookt, calories, protein, : arguments imply differing number of rows: 1, 14, 0

事实证明,我试图刮的一些网址对于他们的指示部分有不同的标记。这导致xpathSApply scrape指令返回长度为0的列表,这会在传递给rbind时产生错误。

找出问题只是运行每个网址,直到找到一个产生错误,并检查该网页的html结构。

这是我最初写的函数:

f4fscrape <- function(url) {

#Create an empty dataframe

    df <- data.frame(matrix(ncol = 11, nrow = 0))
    colnames <- c('address', 'recipename', 'prept', 'cookt',
                  'calories', 'protein', 'carbs', 'fat',
                  'servings', 'ingredients', 'instructions')
    colnames(df) <- paste(colnames)

    #check for the recipe url in dataframe already,
    #only carry on if not present

    for (i in length(url)) 
            if (url[i] %in% df$url) { next }
    else {

    #parse url as html

    doc2 <-htmlTreeParse(url[i], useInternalNodes = TRUE)

    #define the root node

    top2 <- xmlRoot(doc2)

    #scrape relevant data

    address <- url[i]
    recipename <- xpathSApply(top2[[2]], "//h1[@class='fn']", xmlValue)
    prept <- xpathSApply(top2[[2]], "//span[@class='prT']", xmlValue)
    cookt <- xpathSApply(top2[[2]], "//span[@class='ckT']", xmlValue)
    calories <- xpathSApply(top2[[2]], "//span[@class='clrs']", xmlValue)
    protein <- xpathSApply(top2[[2]], "//span[@class='prtn']", xmlValue)
    carbs <- xpathSApply(top2[[2]], "//span[@class='crbs']", xmlValue)
    fat <- xpathSApply(top2[[2]], "//span[@class='fat']", xmlValue)
    servings <- xpathSApply(top2[[2]], "//span[@class='yld']", xmlValue)
    ingredients <- xpathSApply(top2[[2]], "//span[@class='ingredient']", xmlValue)
    instructions <- xpathSApply(top2[[2]], "//ol[@class='methodOL']", xmlValue)

    #create a data.frame of the url and relevant data.

    result <- data.frame(address, recipename, prept, cookt, 
                         calories, protein, carbs, fat, 
                         servings, list(ingredients), instructions)

    #rename the tricky column

    colnames(result)[10] <- 'ingredients'

    #bind data to existing df

    df <- rbind(df, result)
            }

    #return df

    df
}

这是解决方案 - 我只是添加了一个条件如下:

instructions <- xpathSApply(top2[[2]], "//ol[@class='methodOL']", xmlValue)
            if (length(instructions) == 0) {
                    instructions <- xpathSApply(top2[[2]], "//ul[@class='b-list m-circle instrs']", xmlValue)}

1 个答案:

答案 0 :(得分:0)

我能够调整你的功能以便它起作用:

f4fscrape <- function(urls) {

  #Create an empty dataframe

  df <- data.frame(matrix(ncol = 11, nrow = 0))
  cnames <- c('address', 'recipename', 'prept', 'cookt',
                'calories', 'protein', 'carbs', 'fat',
                'servings', 'ingredients', 'instructions')

  names(df) <- cnames

  #check for the recipe url in dataframe already,
  #only carry on if not present

  for (i in 1:length(urls)) 
    if (urls[i] %in% df$address) {
      next }
  else {
    #parse url as html

    doc2 <-htmlTreeParse(urls[i], useInternalNodes = TRUE)

    #define the root node

    top2 <- xmlRoot(doc2)

    #scrape relevant data

    address <- urls[i]
    recipename <- xpathSApply(top2[[2]], "//h1[@class='fn']", xmlValue)
    prept <- xpathSApply(top2[[2]], "//span[@class='prepTime']", xmlValue)
    cookt <- xpathSApply(top2[[2]], "//span[@class='cookTime']", xmlValue)
    calories <- xpathSApply(top2[[2]], "//span[@class='calories']", xmlValue)
    protein <- xpathSApply(top2[[2]], "//span[@class='protein']", xmlValue)
    carbs <- xpathSApply(top2[[2]], "//span[@class='carbohydrates']", xmlValue)
    fat <- xpathSApply(top2[[2]], "//span[@class='fat']", xmlValue)
    servings <- xpathSApply(top2[[2]], "//span[@class='yield']", xmlValue)
    ingredients <- xpathSApply(top2[[2]], "//span[@class='ingredient']", xmlValue)
    instructions <- xpathSApply(top2[[2]], "//ol[@class='methodOL']", xmlValue)

    #create a data.frame of the url and relevant data.

    result <- data.frame(address, recipename, prept, cookt, 
                         calories, protein, carbs, fat, 
                         servings, paste0(ingredients, collapse = ", "), instructions, stringsAsFactors = FALSE)


    df <- rbind(df, setNames(result, names(df)))
  }

  #return df

  df
}

的变化:

1)url是一个内置函数,所以我重命名为urls,类似于colnames

2)我改变了分配列名的方式。

3)循环for (i in length(url))跳到最后一个索引。我改成了 for (i in 1:length(urls))

4)条件if (url[i] %in% df$url)引用了不存在的列(url)。我将其更改为address

5)最重要的变化:我使用paste0将成分连接成一个字符串。根据你所拥有的,在1-url的情况下,每种成分都放在自己的生产线上,而其他的专栏(通过回收规则)只是重复了。使用单个网址运行您当前的代码并View()结果 - 它可能不是您想要的,因此“当一个网址传递给它时,它会起作用”。

6)使用所有这些长字符串,设置stringsAsFactors = FALSE似乎很好。

7)当您rbind新行时,需要在数据框中设置名称。请参阅this问题。

当您View在给定列表上运行调整函数的结果时,您会看到以下内容(虽然当然不会缩小):

enter image description here

我对XML库不太了解,无法帮助您提高速度。有时它运行缓慢,有时很快,所以它可能必须主要与连接速度有关,并且在很大程度上超出了你的控制范围。