原始数据:在R

时间:2016-05-16 22:36:46

标签: r csv

我正在尝试阅读Baylor数据集,但由于空格不一致,我无法使用read.csv。

我确实有列号,所以我在想read.fwf有助于解决我的问题,但这意味着我必须检查超过100个属性并检查线宽。

是否有更简单的方法来读取数据?

baylor <- read.csv('C:/Users/Documents/baylor-religion-survey-data-2007.txt', header=F)

Column Numbers Baylor Religion 2007 Survey Data

1 个答案:

答案 0 :(得分:3)

我没有仔细测试过,但我认为这样做了:

定义网址:

lnum_url <- "http://facweb.cdm.depaul.edu/sjost/csc433/projects/baylor-religion-survey-column-numbers.txt"
survey_url <- "http://facweb.cdm.depaul.edu/sjost/csc433/projects/baylor-religion-survey-data-2007.txt"

使用列信息读取文件:

nums <- read.table(url(lnum_url),as.is=TRUE,header=TRUE)

提取每个字段的起始列:

startcol <- as.numeric( ## convert to numeric
      sapply(
          strsplit(nums[,3],"-"),  ## split strings on dashes
          "[",1))  ## select first element of each result
## sapply(z,"[",1)  == sapply(z,function(x) x[1])

字段宽度是差异(假设最后一个字段是长度1):

w <- c(diff(startcol),1)

读取固定宽度:

r <- read.fwf(url(survey_url),widths=w)

指定字段名称:

names(r) <- gsub(":","",nums$COL)

一些快速检查:

str(r[,1:8])
## 'data.frame':    1648 obs. of  8 variables:
##  $ ID      : num  1.1e+09 1.1e+09 1.1e+09 1.1e+09 1.1e+09 ...
##  $ WEIGHT  : num  0.822 0.312 1.604 1.184 1.35 ...
##  $ REGION  : int  3 3 4 3 2 2 2 4 2 2 ...
##  $ RELIG1  : int  12 12 46 45 14 31 16 33 16 16 ...
##  $ RELIG2  : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ DENOM   : Factor w/ 301 levels "                                                   ",..: 231 231 1 1 1 1 83 113 1 23 ...
##  $ RELGIOUS: int  3 4 1 3 3 4 4 4 3 4 ...
##  $ ATTEND  : int  5 8 0 8 3 0 8 7 1 8 ...

tail(sort(levels(r$DENOM)))
## [1] "        RIVER OF LIFE EVANGELICAL FREE OF ELK RIVER"
## [2] "      ELCA - EVANGELICAL LUTHERAN CHURCH OF AMERICA"
## [3] "      WASHBURN CHRISTIAN CHURCH DISCIPLES OF CHRIST"
## [4] "    THE CHURCH OF JESUS CHRIST OF LATTER DAY SAINTS"
## [5] "   GENERAL ASSOCIATION OF REGULAR BAPTISTS CHURCHES"
## [6] "CONGREGATIONAL/METHODIST UNITED CHURCHES OF DURHAM,"

可能会有更多的处理(例如剥离面额中的空白区域),我当然会进一步检查这些结果,但这应该可以让你在那里大部分时间。

为了将来参考,可能值得从original download site下载数据并检查code book的交叉表......