R-Bing搜索循环未完全运行-XML错误消息

时间:2018-12-04 10:10:53

标签: r xml for-loop xpath xml-parsing

我一直在使用for循环来提取Bing搜索结果,以结合不同的水果(“关键字”)来查找许多名人姓名(“ all $ names”)。 循环本身运行,但过了一会儿它卡住了,我收到以下错误消息:

Error: 
1: XML declaration allowed only at the start of the document
2: Unescaped '<' not allowed in attributes values
3: xmlns:content: 'http ...' is not a valid URI
4: attributes construct error
5: Couldn't find end of Start Tag html line 2

也许值得一提的是,我的实际名称列表(“ all $ name”)包含195行,而我的实际“关键字”列表则是101行(因此约有2万个组合)。出于保密原因,我无法在此处显示它们。

在进行故障排除时,返回数据帧似乎还不错,因此我认为这是XML解析中的问题,同样在错误消息中也有说明。

这是我正在使用的代码:

search_results_matrix <- data.frame()

for(i in 1:length(keywords)){
  search_links <-  sapply(trimws(all$name),function(x)
  paste0(URLencode(
  'https://www.bing.com/searchcount=100&offset=0&format=rss&
  safeSearch=Off&q="')
  URLencode(x,reserved = T),
  URLencode( '" AND "'),
  URLencode(keywords[i],reserved = T),
  URLencode('"')))
  return <- getURIAsynchronous(search_links)

 search_results <- sapply(return, function(x) 
 length(xpathSApply(xmlParse(x), "//item//title",xmlValue)))

 search_results_matrix <- rbind(search_results_matrix,
 data.frame(name = trimws(all$name), keyword = keywords[i], 
 search_results, search_link = search_links))
 }

这是再现示例所需的输入数据:

keywords
## Avocado
## Banana
## Bilberry
## Blackberry
## Blackcurrant
## Blueberry
## Boysenberry
## Crab apples
## Currant
## Cherry

all
##   name
## 1 Julia Roberts
## 2 George Lucas
## 3 Oprah Winfrey
## 4 Tom Hanks
## 5 Michael Jordan
## 6 The Rolling Stones
## 7 Tiger Woods
## 8 Backstreet Boys
## 9 Cher
## 10 Steven Spielberg

0 个答案:

没有答案