通过HTTPS将多个CSV文件导入R

时间:2014-12-31 16:48:18

标签: r rcurl

我正在尝试通过HTTPS(从Google Drive Sheets)导入多个CSV文件到R.

以下是我使用RCurl导入一个CSV文件的方法(有效):

#Load packages
require(RCurl)
require(plyr)

x <- getURL("https://docs.google.com/spreadsheet/pub?key=0AsDUegPJ1ngvdDFLWXZXb08wMVIzY3JrX2tNU2dROEE&output=csv")
x <- read.csv(textConnection(x), header = TRUE, stringsAsFactors = FALSE, skip=1)

然后,我创建了一个名为“hashtags”的数据框,其URL包含12个CSV文件及其名称,以便导入所有文件。这是前六行主题标签

> head(hashtags)
name             url
1 #capstoneisfun https://docs.google.com/spreadsheet/pub?key=0AsDUegPJ1ngvdDFLWXZXb08wMVIzY3JrX2tNU2dROEE&output=csv
2 #CEP810        https://docs.google.com/spreadsheet/pub?key=0AsDUegPJ1ngvdFlQS2FPNzJsdS1TMVBuTHlQTS1FRnc&output=csv
3 #CEP811        https://docs.google.com/spreadsheet/pub?key=0AsDUegPJ1ngvdDhLcEI1a0U1T0I0Zm5RaU5UVWdmdlE&output=csv
4 #CEP812        https://docs.google.com/spreadsheet/pub?key=0AsDUegPJ1ngvdDJzMjZhN2pGa29QYU5weVhZdjRKdmc&output=csv
5 #CEP813        https://docs.google.com/spreadsheet/pub?key=0AsDUegPJ1ngvdGpJa0VMTmJNdzZ4UjBvUEx5cWsycEE&output=csv
6 #CEP815        https://docs.google.com/spreadsheet/pub?key=0AsDUegPJ1ngvdFB2R0czWjJ2SU9HQWR5VUVuODk3R0E&output=csv

我想要做的是将所有文件导入为数据框。我知道应用函数或for循环可以做到这一点,但两者都超出了我目前的能力。

3 个答案:

答案 0 :(得分:2)

这是一个使用curl()软件包的好地方,它提供了url()&#34;替代library(curl) urls <- c( "https://docs.google.com/spreadsheet/pub?key=0AsDUegPJ1ngvdDFLWXZXb08wMVIzY3JrX2tNU2dROEE&output=csv", "https://docs.google.com/spreadsheet/pub?key=0AsDUegPJ1ngvdFlQS2FPNzJsdS1TMVBuTHlQTS1FRnc&output=csv" ) cons <- lapply(urls, curl) lapply(cons, read.csv, stringsAsFactors = FALSE, skip = 1) &#34;适用于https:

{{1}}

答案 1 :(得分:2)

这里有一个使用httr(它可以改进RCurl并在Windows上提供更好的时间)和data.table&#39; rbindlist,因此您可以获得结果数据。包含一个对象中的所有推文和主题标签的表必须通过列表进行处理。只使用dplyr,因为它是我现在每天都使用的东西。可以轻松删除和替换基本操作与%>%

library(httr)
library(dplyr)

hashtags <- read.table(text="hashtag,url
#capstoneisfun,https://docs.google.com/spreadsheet/pub?key=0AsDUegPJ1ngvdDFLWXZXb08wMVIzY3JrX2tNU2dROEE&output=csv
#CEP810,https://docs.google.com/spreadsheet/pub?key=0AsDUegPJ1ngvdFlQS2FPNzJsdS1TMVBuTHlQTS1FRnc&output=csv
#CEP811,https://docs.google.com/spreadsheet/pub?key=0AsDUegPJ1ngvdDhLcEI1a0U1T0I0Zm5RaU5UVWdmdlE&output=csv
#CEP812,https://docs.google.com/spreadsheet/pub?key=0AsDUegPJ1ngvdDJzMjZhN2pGa29QYU5weVhZdjRKdmc&output=csv
#CEP813,https://docs.google.com/spreadsheet/pub?key=0AsDUegPJ1ngvdGpJa0VMTmJNdzZ4UjBvUEx5cWsycEE&output=csv
#CEP815,https://docs.google.com/spreadsheet/pub?key=0AsDUegPJ1ngvdFB2R0czWjJ2SU9HQWR5VUVuODk3R0E&output=csv", 
                       stringsAs=FALSE, header=TRUE, sep=",", comment.char="")

tweets <- data.table::rbindlist(by(hashtags, hashtags$hashtag, function(x) {
  doc <- GET(x$url)
  dat <- read.csv(textConnection(content(doc, as="text")), header=TRUE, stringsAs=FALSE, sep=",", skip=1)
  dat <- dat %>% mutate(hashtag=x$hashtag)
  dat  
}))

nrow(tweets)
## [1] 1618

glimpse(tweets)

## Variables:
## $ Date         (chr) "12/12/2014 21:51:49", "11/19/2014 10:17:39", "11/16/2014 4:2...
## $ Twitter.User (chr) "https://twitter.com/matthewkoehler/status/543440594446868481...
## $ Followers    (int) 946, 895, 399, 12, 153, 881, 216, 865, 395, 12, 82, 857, 393,...
## $ Follows      (int) 994, 907, 1174, 24, 114, 887, 492, 869, 1148, 24, 201, 855, 1...
## $ Retweets     (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 2, 1, 1, 1, 0, 0, 0, 0, 0, 0...
## $ Favorites    (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0...
## $ Tweet.Text   (chr) "#capstoneisfun Awesome TA of the Week is @spgreenhalgh ! htt...
## $ hashtag      (chr) "#capstoneisfun", "#capstoneisfun", "#capstoneisfun", "#capst...

tweets$hashtag %>% unique

## [1] "#capstoneisfun" "#CEP810"        "#CEP811"        "#CEP812"       
## [5] "#CEP813"        "#CEP815"       

答案 2 :(得分:-1)

也许:

dfList <-list()
for( i in 1:nrow(hashtags) ){ 
   x <- getURL( hashtags[i, "url"] )
   dfList[[ hashtags[i,1] ]] <- read.csv(textConnection(x), header = TRUE, 
                                         stringsAsFactors = FALSE, skip=1)
                            }

似乎是成功的(虽然我认为没有必要加载pkg :: plyr并且代码在没有这样做的情况下进行了测试。)str(dfList)的输出顶部:

str(dfList)
List of 6
 $ #capstoneisfun:'data.frame': 63 obs. of  7 variables:
  ..$ Date        : chr [1:63] "12/12/2014 21:51:49" "11/19/2014 10:17:39" "11/16/2014 4:29:39" "11/14/2014 5:44:57" ...
  ..$ Twitter.User: chr [1:63] "https://twitter.com/matthewkoehler/status/543440594446868481" "https://twitter.com/matthewkoehler/status/534930982802321408" "https://twitter.com/spgreenhalgh/status/533756240837771265" "https://twitter.com/sarahfkeenan/status/533050416087715840" ...
  ..$ Followers   : int [1:63] 946 895 399 12 153 881 216 865 395 12 ...
  ..$ Follows     : int [1:63] 994 907 1174 24 114 887 492 869 1148 24 ...
  ..$ Retweets    : int [1:63] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ Favorites   : int [1:63] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ Tweet.Text  : chr [1:63] "#capstoneisfun Awesome TA of the Week is @spgreenhalgh ! http://t.co/fbKqtHAhcl" "Module 12 is beginning! #capstoneisfun" "Had a fantastic time with #capstoneisfun students today in exhibitions! So fun to see everyone's portfolios as they're finishin"| __truncated__ "@emstrazz, your intended audience can 
 # snipped rest