Question

我已经重写了以下内容以澄清这个问题和解决方案，并将功能和解决方案留在底部作为示例。再次感谢John Coleman的帮助！

问题：我创建了一个数据scrape函数，它在传递一个url时起作用，但不是一个url向量，抛出了这个错误：

Error in data.frame(address, recipename, prept, cookt, calories, protein, : arguments imply differing number of rows: 1, 14, 0

事实证明，我试图刮的一些网址对于他们的指示部分有不同的标记。这导致xpathSApply scrape指令返回长度为0的列表，这会在传递给rbind时产生错误。

找出问题只是运行每个网址，直到找到一个产生错误，并检查该网页的html结构。

这是我最初写的函数：

f4fscrape <- function(url) {

#Create an empty dataframe

    df <- data.frame(matrix(ncol = 11, nrow = 0))
    colnames <- c('address', 'recipename', 'prept', 'cookt',
                  'calories', 'protein', 'carbs', 'fat',
                  'servings', 'ingredients', 'instructions')
    colnames(df) <- paste(colnames)

    #check for the recipe url in dataframe already,
    #only carry on if not present

    for (i in length(url)) 
            if (url[i] %in% df$url) { next }
    else {

    #parse url as html

    doc2 <-htmlTreeParse(url[i], useInternalNodes = TRUE)

    #define the root node

    top2 <- xmlRoot(doc2)

    #scrape relevant data

    address <- url[i]
    recipename <- xpathSApply(top2[[2]], "//h1[@class='fn']", xmlValue)
    prept <- xpathSApply(top2[[2]], "//span[@class='prT']", xmlValue)
    cookt <- xpathSApply(top2[[2]], "//span[@class='ckT']", xmlValue)
    calories <- xpathSApply(top2[[2]], "//span[@class='clrs']", xmlValue)
    protein <- xpathSApply(top2[[2]], "//span[@class='prtn']", xmlValue)
    carbs <- xpathSApply(top2[[2]], "//span[@class='crbs']", xmlValue)
    fat <- xpathSApply(top2[[2]], "//span[@class='fat']", xmlValue)
    servings <- xpathSApply(top2[[2]], "//span[@class='yld']", xmlValue)
    ingredients <- xpathSApply(top2[[2]], "//span[@class='ingredient']", xmlValue)
    instructions <- xpathSApply(top2[[2]], "//ol[@class='methodOL']", xmlValue)

    #create a data.frame of the url and relevant data.

    result <- data.frame(address, recipename, prept, cookt, 
                         calories, protein, carbs, fat, 
                         servings, list(ingredients), instructions)

    #rename the tricky column

    colnames(result)[10] <- 'ingredients'

    #bind data to existing df

    df <- rbind(df, result)
            }

    #return df

    df
}

这是解决方案 - 我只是添加了一个条件如下：

instructions <- xpathSApply(top2[[2]], "//ol[@class='methodOL']", xmlValue)
            if (length(instructions) == 0) {
                    instructions <- xpathSApply(top2[[2]], "//ul[@class='b-list m-circle instrs']", xmlValue)}

Answer 1

我能够调整你的功能以便它起作用：

f4fscrape <- function(urls) {

  #Create an empty dataframe

  df <- data.frame(matrix(ncol = 11, nrow = 0))
  cnames <- c('address', 'recipename', 'prept', 'cookt',
                'calories', 'protein', 'carbs', 'fat',
                'servings', 'ingredients', 'instructions')

  names(df) <- cnames

  #check for the recipe url in dataframe already,
  #only carry on if not present

  for (i in 1:length(urls)) 
    if (urls[i] %in% df$address) {
      next }
  else {
    #parse url as html

    doc2 <-htmlTreeParse(urls[i], useInternalNodes = TRUE)

    #define the root node

    top2 <- xmlRoot(doc2)

    #scrape relevant data

    address <- urls[i]
    recipename <- xpathSApply(top2[[2]], "//h1[@class='fn']", xmlValue)
    prept <- xpathSApply(top2[[2]], "//span[@class='prepTime']", xmlValue)
    cookt <- xpathSApply(top2[[2]], "//span[@class='cookTime']", xmlValue)
    calories <- xpathSApply(top2[[2]], "//span[@class='calories']", xmlValue)
    protein <- xpathSApply(top2[[2]], "//span[@class='protein']", xmlValue)
    carbs <- xpathSApply(top2[[2]], "//span[@class='carbohydrates']", xmlValue)
    fat <- xpathSApply(top2[[2]], "//span[@class='fat']", xmlValue)
    servings <- xpathSApply(top2[[2]], "//span[@class='yield']", xmlValue)
    ingredients <- xpathSApply(top2[[2]], "//span[@class='ingredient']", xmlValue)
    instructions <- xpathSApply(top2[[2]], "//ol[@class='methodOL']", xmlValue)

    #create a data.frame of the url and relevant data.

    result <- data.frame(address, recipename, prept, cookt, 
                         calories, protein, carbs, fat, 
                         servings, paste0(ingredients, collapse = ", "), instructions, stringsAsFactors = FALSE)


    df <- rbind(df, setNames(result, names(df)))
  }

  #return df

  df
}

的变化：

1）url是一个内置函数，所以我重命名为urls，类似于colnames

2）我改变了分配列名的方式。

3）循环for (i in length(url))跳到最后一个索引。我改成了 for (i in 1:length(urls))

4）条件if (url[i] %in% df$url)引用了不存在的列（url）。我将其更改为address。

5）最重要的变化：我使用paste0将成分连接成一个字符串。根据你所拥有的，在1-url的情况下，每种成分都放在自己的生产线上，而其他的专栏（通过回收规则）只是重复了。使用单个网址运行您当前的代码并View()结果 - 它可能不是您想要的，因此“当一个网址传递给它时，它会起作用”。

6）使用所有这些长字符串，设置stringsAsFactors = FALSE似乎很好。

7）当您rbind新行时，需要在数据框中设置名称。请参阅this问题。

当您View在给定列表上运行调整函数的结果时，您会看到以下内容（虽然当然不会缩小）：

我对XML库不太了解，无法帮助您提高速度。有时它运行缓慢，有时很快，所以它可能必须主要与连接速度有关，并且在很大程度上超出了你的控制范围。

函数抛出应用于向量但不是一个元素的错误

1 个答案: