我已经重写了以下内容以澄清这个问题和解决方案,并将功能和解决方案留在底部作为示例。再次感谢John Coleman的帮助!
问题:我创建了一个数据scrape函数,它在传递一个url时起作用,但不是一个url向量,抛出了这个错误:
Error in data.frame(address, recipename, prept, cookt, calories, protein, :
arguments imply differing number of rows: 1, 14, 0
事实证明,我试图刮的一些网址对于他们的指示部分有不同的标记。这导致xpathSApply
scrape指令返回长度为0的列表,这会在传递给rbind时产生错误。
找出问题只是运行每个网址,直到找到一个产生错误,并检查该网页的html结构。
这是我最初写的函数:
f4fscrape <- function(url) {
#Create an empty dataframe
df <- data.frame(matrix(ncol = 11, nrow = 0))
colnames <- c('address', 'recipename', 'prept', 'cookt',
'calories', 'protein', 'carbs', 'fat',
'servings', 'ingredients', 'instructions')
colnames(df) <- paste(colnames)
#check for the recipe url in dataframe already,
#only carry on if not present
for (i in length(url))
if (url[i] %in% df$url) { next }
else {
#parse url as html
doc2 <-htmlTreeParse(url[i], useInternalNodes = TRUE)
#define the root node
top2 <- xmlRoot(doc2)
#scrape relevant data
address <- url[i]
recipename <- xpathSApply(top2[[2]], "//h1[@class='fn']", xmlValue)
prept <- xpathSApply(top2[[2]], "//span[@class='prT']", xmlValue)
cookt <- xpathSApply(top2[[2]], "//span[@class='ckT']", xmlValue)
calories <- xpathSApply(top2[[2]], "//span[@class='clrs']", xmlValue)
protein <- xpathSApply(top2[[2]], "//span[@class='prtn']", xmlValue)
carbs <- xpathSApply(top2[[2]], "//span[@class='crbs']", xmlValue)
fat <- xpathSApply(top2[[2]], "//span[@class='fat']", xmlValue)
servings <- xpathSApply(top2[[2]], "//span[@class='yld']", xmlValue)
ingredients <- xpathSApply(top2[[2]], "//span[@class='ingredient']", xmlValue)
instructions <- xpathSApply(top2[[2]], "//ol[@class='methodOL']", xmlValue)
#create a data.frame of the url and relevant data.
result <- data.frame(address, recipename, prept, cookt,
calories, protein, carbs, fat,
servings, list(ingredients), instructions)
#rename the tricky column
colnames(result)[10] <- 'ingredients'
#bind data to existing df
df <- rbind(df, result)
}
#return df
df
}
这是解决方案 - 我只是添加了一个条件如下:
instructions <- xpathSApply(top2[[2]], "//ol[@class='methodOL']", xmlValue)
if (length(instructions) == 0) {
instructions <- xpathSApply(top2[[2]], "//ul[@class='b-list m-circle instrs']", xmlValue)}
答案 0 :(得分:0)
我能够调整你的功能以便它起作用:
f4fscrape <- function(urls) {
#Create an empty dataframe
df <- data.frame(matrix(ncol = 11, nrow = 0))
cnames <- c('address', 'recipename', 'prept', 'cookt',
'calories', 'protein', 'carbs', 'fat',
'servings', 'ingredients', 'instructions')
names(df) <- cnames
#check for the recipe url in dataframe already,
#only carry on if not present
for (i in 1:length(urls))
if (urls[i] %in% df$address) {
next }
else {
#parse url as html
doc2 <-htmlTreeParse(urls[i], useInternalNodes = TRUE)
#define the root node
top2 <- xmlRoot(doc2)
#scrape relevant data
address <- urls[i]
recipename <- xpathSApply(top2[[2]], "//h1[@class='fn']", xmlValue)
prept <- xpathSApply(top2[[2]], "//span[@class='prepTime']", xmlValue)
cookt <- xpathSApply(top2[[2]], "//span[@class='cookTime']", xmlValue)
calories <- xpathSApply(top2[[2]], "//span[@class='calories']", xmlValue)
protein <- xpathSApply(top2[[2]], "//span[@class='protein']", xmlValue)
carbs <- xpathSApply(top2[[2]], "//span[@class='carbohydrates']", xmlValue)
fat <- xpathSApply(top2[[2]], "//span[@class='fat']", xmlValue)
servings <- xpathSApply(top2[[2]], "//span[@class='yield']", xmlValue)
ingredients <- xpathSApply(top2[[2]], "//span[@class='ingredient']", xmlValue)
instructions <- xpathSApply(top2[[2]], "//ol[@class='methodOL']", xmlValue)
#create a data.frame of the url and relevant data.
result <- data.frame(address, recipename, prept, cookt,
calories, protein, carbs, fat,
servings, paste0(ingredients, collapse = ", "), instructions, stringsAsFactors = FALSE)
df <- rbind(df, setNames(result, names(df)))
}
#return df
df
}
的变化:
1)url
是一个内置函数,所以我重命名为urls
,类似于colnames
2)我改变了分配列名的方式。
3)循环for (i in length(url))
跳到最后一个索引。我改成了
for (i in 1:length(urls))
4)条件if (url[i] %in% df$url)
引用了不存在的列(url
)。我将其更改为address
。
5)最重要的变化:我使用paste0
将成分连接成一个字符串。根据你所拥有的,在1-url的情况下,每种成分都放在自己的生产线上,而其他的专栏(通过回收规则)只是重复了。使用单个网址运行您当前的代码并View()
结果 - 它可能不是您想要的,因此“当一个网址传递给它时,它会起作用”。
6)使用所有这些长字符串,设置stringsAsFactors = FALSE
似乎很好。
7)当您rbind
新行时,需要在数据框中设置名称。请参阅this问题。
当您View
在给定列表上运行调整函数的结果时,您会看到以下内容(虽然当然不会缩小):
我对XML
库不太了解,无法帮助您提高速度。有时它运行缓慢,有时很快,所以它可能必须主要与连接速度有关,并且在很大程度上超出了你的控制范围。