我想通过下载链接下载带有R的日志文件,但我只获得未评估的html。
这是我尝试过的,没有任何成功:
url = "http://statcounter.com/p7447608/csv/download_log_file?ufrom=1323783441&uto=1323860282"
# SSL-certificate:
CAINFO = paste(system.file(package="RCurl"), "/CurlSSL/ca-bundle.crt", sep = "")
curlH = getCurlHandle(
header = FALSE,
verbose = TRUE,
netrc = TRUE,
maxredirs = as.integer(20),
followlocation = TRUE,
userpwd = "me:mypassw",
ssl.verifypeer = TRUE)
setwd(tempdir())
destfile = "log.csv"
x = getBinaryURL(url, curl = curlH,
cainfo = CAINFO)
shell.exec(dir())
答案 0 :(得分:2)
以下是两种下载文件的方法。
似乎在将文件重命名为log.html并打开它时,我们的登录信息无效。这就是你得到html结构的原因。您需要将登录凭据添加到URL。
您可以从html源代码中获取名称值对:
<label for="username2">Username:</label>
<input type="text" id="username2" name="form_user" value="" size="12" maxlength="64" class="large">
<span class="label-overlay">
<label for="password2">Password:</label>
<input type="password" name="form_pass" id="password2" value="" size="12" maxlength="64" class="large">
正如您所看到的,用户名的名称值对称为form_user = USERNAME,密码的名称值对称为form_pass = PASSWORD。
这就是curl userpwd设置不起作用的原因,它无法识别id或名称。
## Url for downloading - Does not contain login credentials.
url <- "http://statcounter.com/p7447608/csv/download_log_file?ufrom=1323783441&uto=1323860282"
USERNAME = 'your username'
PASSWORD = 'your password'
## Url for downloading - Does contain login credentials. Use this one!!
url <- paste( 'http://statcounter.com/p7447608/csv/download_log_file?ufrom=1323783441&uto=1323860282&form_user=', USERNAME, '&form_pass=', PASSWORD, sep = '')
## method one, using download file
download.file(url, destfile = "log.csv" )
csv.data <- read.csv("log.csv" )
head(csv.data)
## method 2 using curl
CAINFO = paste(system.file(package="RCurl"), "/CurlSSL/ca-bundle.crt", sep = "")
cookie = 'cookiefile.txt'
curlH = getCurlHandle(
cookiefile = cookie,
useragent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en - US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6",
header = FALSE,
verbose = TRUE,
netrc = TRUE,
maxredirs = as.integer(20),
followlocation = TRUE,
ssl.verifypeer = TRUE)
destfile = "log2.csv"
content = getBinaryURL(url, curl = curlH, cainfo = CAINFO)
## write to file
writeBin(content, destfile)
## read from binary object
csv.data2 <- read.csv(textConnection(rawToChar(content)))
head(csv.data2)
csv.data2 == csv.data
答案 1 :(得分:1)
您似乎不需要SSL证书等,因为网址为http:
,而不是https:
...所以在这种情况下,download.file(url, "log.csv")
可能会正常工作吗?
我首先要确保网址及其响应在R之外是正确的。
...我使用Chrome访问该网址并获得了一个下载文件“StatCounter-Log-7447608.csv”。它包含一个csv头和 HTML ?!
"Date and Time","IP Address","IP Address Label","Browser","Version","OS","Resolution","Country","Region","City","Postal Code","ISP","Returning Count","Page URL","Page Title","Came From","SE Name","SE Host","SE Term"
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Author" content="StatCounter">
...