我正在尝试阅读Baylor数据集,但由于空格不一致,我无法使用read.csv。
我确实有列号,所以我在想read.fwf有助于解决我的问题,但这意味着我必须检查超过100个属性并检查线宽。
是否有更简单的方法来读取数据?
baylor <- read.csv('C:/Users/Documents/baylor-religion-survey-data-2007.txt', header=F)
答案 0 :(得分:3)
我没有仔细测试过,但我认为这样做了:
定义网址:
lnum_url <- "http://facweb.cdm.depaul.edu/sjost/csc433/projects/baylor-religion-survey-column-numbers.txt"
survey_url <- "http://facweb.cdm.depaul.edu/sjost/csc433/projects/baylor-religion-survey-data-2007.txt"
使用列信息读取文件:
nums <- read.table(url(lnum_url),as.is=TRUE,header=TRUE)
提取每个字段的起始列:
startcol <- as.numeric( ## convert to numeric
sapply(
strsplit(nums[,3],"-"), ## split strings on dashes
"[",1)) ## select first element of each result
## sapply(z,"[",1) == sapply(z,function(x) x[1])
字段宽度是差异(假设最后一个字段是长度1):
w <- c(diff(startcol),1)
读取固定宽度:
r <- read.fwf(url(survey_url),widths=w)
指定字段名称:
names(r) <- gsub(":","",nums$COL)
一些快速检查:
str(r[,1:8])
## 'data.frame': 1648 obs. of 8 variables:
## $ ID : num 1.1e+09 1.1e+09 1.1e+09 1.1e+09 1.1e+09 ...
## $ WEIGHT : num 0.822 0.312 1.604 1.184 1.35 ...
## $ REGION : int 3 3 4 3 2 2 2 4 2 2 ...
## $ RELIG1 : int 12 12 46 45 14 31 16 33 16 16 ...
## $ RELIG2 : int NA NA NA NA NA NA NA NA NA NA ...
## $ DENOM : Factor w/ 301 levels " ",..: 231 231 1 1 1 1 83 113 1 23 ...
## $ RELGIOUS: int 3 4 1 3 3 4 4 4 3 4 ...
## $ ATTEND : int 5 8 0 8 3 0 8 7 1 8 ...
tail(sort(levels(r$DENOM)))
## [1] " RIVER OF LIFE EVANGELICAL FREE OF ELK RIVER"
## [2] " ELCA - EVANGELICAL LUTHERAN CHURCH OF AMERICA"
## [3] " WASHBURN CHRISTIAN CHURCH DISCIPLES OF CHRIST"
## [4] " THE CHURCH OF JESUS CHRIST OF LATTER DAY SAINTS"
## [5] " GENERAL ASSOCIATION OF REGULAR BAPTISTS CHURCHES"
## [6] "CONGREGATIONAL/METHODIST UNITED CHURCHES OF DURHAM,"
可能会有更多的处理(例如剥离面额中的空白区域),我当然会进一步检查这些结果,但这应该可以让你在那里大部分时间。
为了将来参考,可能值得从original download site下载数据并检查code book的交叉表......