Question

我是R编程的新手。我在mac OS X El Capitan V10.11.6中使用R 3.4.2。

当我尝试从下面的url读取数据时，我遇到了错误。

数据源链接： https://dumps.wikimedia.org/other/pageviews/2017/2017-10/pageviews-20171001-010000.gz

该文件包含四个字段：语言，维基百科页面标题，此页面收到的页面请求数，返回内容的总大小（以字节为单位）。它是由空格分隔的csv文件，没有标题行。

我尝试使用以下代码阅读表格：

df <- read.table("https://dumps.wikimedia.org/other/pageviews/2017/2017-10/pageviews-20171001-010000.gz", sep = " ", stringsAsFactors = FALSE, header = FALSE, encoding = "UTF-8")

我得到的错误是

扫描错误（file = file，what = what，sep = sep，quote = quote，dec = dec，：第1行没有2个元素另外：警告信息：在read.table（“https://dumps.wikimedia.org/other/pageviews/2017/2017-10/pageviews-20171001-010000.gz”中，：第1行似乎包含嵌入的空值

我也尝试使用readr包，但仍然失败了。我使用的代码在

下面

df <- read_delim("https://dumps.wikimedia.org/other/pageviews/2017/2017-10/pageviews-20171001-010000.gz", delim = " ", col_names = FALSE)

BTW，当我用spark scala读取这些数据时，没有问题。

Answer 1

library(stringi)
library(tidyverse)

gzfile("pageviews-20171001-010000.gz") %>% 
  readLines(skipNul=TRUE) %>% 
  stri_split_fixed(" ", simplify=TRUE) %>% 
  as_data_frame() -> xmat

xmat

## # A tibble: 4,598,475 x 4
##       V1                                V2    V3    V4
##    <chr>                             <chr> <chr> <chr>
##  1    aa                 Category:Articles     1     0
##  2    aa                  Category:User_aa     1     0
##  3    aa        File:Wikipedia-logo-en.png     2     0
##  4    aa                         Main_Page    35     0
##  5    aa               Special:ActiveUsers     6     0
##  6    aa Special:Contributions/Lars~aawiki     1     0
##  7    aa    Special:Contributions/PipepBot     1     0
##  8    aa                 Special:ListFiles     3     0
##  9    aa                 Special:ListUsers     3     0
## 10    aa                Special:Statistics    10     0
## # ... with 4,598,465 more rows

Answer 2

适用于我的情况。可能是系统/包版本依赖吗？

library(readr)

df <- read_delim("https://dumps.wikimedia.org/other/pageviews/2017/2017-10/pageviews-20171001-010000.gz", 
                 delim = " ", col_names = FALSE)
df
# A tibble: 4,421,548 x 4
##X1                                X2    X3    X4
##<chr>                             <chr> <int> <int>
##1    aa                 Category:Articles     1     0
##2    aa                  Category:User_aa     1     0
##3    aa        File:Wikipedia-logo-en.png     2     0
##4    aa                         Main_Page    35     0
##5    aa               Special:ActiveUsers     6     0
##6    aa Special:Contributions/Lars~aawiki     1     0
##7    aa    Special:Contributions/PipepBot     1     0
##8    aa                 Special:ListFiles     3     0
##9    aa                 Special:ListUsers     3     0
##10    aa                Special:Statistics    10     0
# ... with 4,421,538 more rows
sessionInfo()
##R version 3.4.2 (2017-09-28)
##Platform: x86_64-pc-linux-gnu (64-bit)
##Running under: Ubuntu 17.10
##
##Matrix products: default
##BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
##LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
##
##locale:
##[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##[7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##[9] LC_ADDRESS=C               LC_TELEPHONE=C            
##[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
##
##attached base packages:
##[1] stats     graphics  grDevices utils     datasets  methods   base

使用R和readr

2 个答案: