并行化以使用R来刮擦Web内容

时间:2015-02-15 10:19:53

标签: r asynchronous xpath web-scraping rcurl

我正在尝试使用异步方法从Web抓取数据in this post。这是我想要从中获取数据的网址。 我将网址存储在 list.Rdata 文件中。这些链接可以从这里下载:https://www.dropbox.com/s/wl2per5npuq5h8y/list.Rdata?dl=1

首先我加载前1000个网址:

library(RCurl)  
library(rvest)
library(XML)
library(httr)
library(reshape2)
library(reshape)

load("list.Rdata")
list <- list[1:1000]
un <- unlist(list)

然后我使用代码来抓取该网址中的内容:

get.asynch <- function(urls){
  txt <- getURIAsynchronous(urls)
    doc <- htmlParse(txt,asText=TRUE,encoding = "UTF-8")
    base <- xpathSApply(doc, "//table//tr//td",xmlValue)
    # Pavadinimas
    uab <- ifelse(length(xpathSApply(doc, "//head//title",xmlValue))==1,gsub(". Rekvizitai.lt","", xpathSApply(doc, "//head//title",xmlValue)), "-")
    # Imones kodas
    ik <- ifelse(is.na(agrep("Imones kodas",base))==TRUE, "-", base[agrep("Imones kodas",base)+1])
    # PVM kodas
    pk <- ifelse(is.na(match("PVM kodas",base))==TRUE, "-", base[match("PVM kodas",base)+1])
    # Vadovas
    vad <- ifelse(is.na(match("Vadovas",base))==TRUE, "-", base[match("Vadovas",base)+1])
    # Adresas
    ad <- ifelse(is.na(match("Adresas",base))==TRUE, "-", base[match("Adresas",base)+1])
    # Telefonas
    tel <- ifelse(is.na(match("Telefonas",base))==TRUE, "-", paste("http://rekvizitai.vz.lt", xpathSApply(doc, "//table//tr//td//@src")[1], sep =""))
    # Mobilusis
    mob <- ifelse(is.na(match("Mobilusis",base))==TRUE, "-", paste("http://rekvizitai.vz.lt", xpathSApply(doc, "//table//tr//td//@src")[2], sep =""))
    # Tinklalapis
    url <- ifelse(is.na(match("Tinklalapis",base))==TRUE, "-", gsub("\t","",base[match("Tinklalapis",base)+1]))
    # Skype
    sk <- ifelse(is.na(match("Skype",base))==TRUE, "-", base[match("Skype",base)+1])
    # Bankas
    bnk <- ifelse(is.na(match("Bankas",base))==TRUE, "-", base[match("Bankas",base)+1])
    # Atsiskaitomoji saskaita
    ats <- ifelse(is.na(match("Atsiskaitomoji saskaita",base))==TRUE, "-", base[match("Atsiskaitomoji saskaita",base)+1])
    # Darbo laikas
    dl <- ifelse(is.na(match("Darbo laikas",base))==TRUE, "-", base[match("Darbo laikas",base)+1])
    # Darbuotojai
    drb <- ifelse(is.na(match("Darbuotojai",base))==TRUE, "-", gsub("\\D","",base[match("Darbuotojai",base)+1]))
    # SD draudejo kodas
    sd <- ifelse(is.na(match("SD draudejo kodas",base))==TRUE, "-", base[match("SD draudejo kodas",base)+1]) 
    # Apyvarta (be PVM)
    apv <- ifelse(is.na(match("Apyvarta (be PVM)",base))==TRUE, "-", base[match("Apyvarta (be PVM)",base)+1])
    # Transportas
    trn <- ifelse(is.na(match("Transportas",base))==TRUE, "-", base[match("Transportas",base)+1])
    # Ivertinimas
    iv <- ifelse(length(xpathSApply(doc, "//span[@class='average']", xmlValue)) !=0, xpathSApply(doc, "//span[@class='average']", xmlValue),"-")
    # Vertintoju skaicius
    vert <- ifelse(length(xpathSApply(doc, "//span[@class='votes']", xmlValue)) !=0, xpathSApply(doc, "//span[@class='votes']", xmlValue),"-")
    # Veiklos sritys
    veikl <-xpathSApply(doc,"//div[@class='floatLeft about']//a | //div[@class='floatLeft about half']//a | //div[@class='about floatLeft']//a",
                        xmlValue)[1]
    # Lentele
    df <- cbind(uab, ik, pk, vad, ad, tel, mob, url, sk, bnk, ats, dl, drb, sd, apv, trn, iv, vert, veikl)
}

接下来,我使用我的函数来解析内容并获取错误。我很确定这个错误是对服务器请求很多的结果。

> system.time(table <- do.call(rbind,lapply(un,get.asynch)))
 Error in which(value == defs) : 
  argument "code" is missing, with no default Timing stopped at: 0.89 0.03 6.82

我正在寻找避免此类行为的解决方案。我尝试了Sys.sleep()函数,但结果是一样的。有关如何克服与服务器问题的连接的任何帮助都将受到欢迎。

1 个答案:

答案 0 :(得分:2)

我搜索了几分钟,然后在这里找到答案(第二次回复)R getURL() returning empty string

您需要使用

txt <- getURIAsynchronous(un, .opts = curlOptions(followlocation = TRUE))

还有另一个问题 - 您实际上并不是异步执行此操作。使用lapply(un,get.asynch),您可以逐个向get.asynch发送网址。要做到平行,你需要像get.asynch(un)这样的东西,但是你必须重写其余的代码。我会把它分成两部分: 卷曲

txts <- getURIAsynchronous(un, .opts=curlOptions(followlocation = TRUE))

并解析

parse <- function(txt) { 
    doc <- htmlParse(txt,asText=TRUE,encoding = "UTF-8")
    base <- xpathSApply(doc, "//table//tr//td",xmlValue)
    ...
}
table <- do.call(rbind, lapply(txts, parse))

Curling对我来说很好,至少对于前100个链接。我没有测试解析部分。