HCUP KID导入R

时间:2016-11-20 00:24:33

标签: r import bigdata

我是所有这些的新手:编程,统计和R.我正在尝试将大型数据集加载到R.它是ASC格式。我已经尝试了很多小时,从read.table,到rgdal,再到read.asc都没有成功。该文件是1.5 GB,所以我无法在文本编辑器中打开它。我有一个导师,他说需要逐行阅读。计划是阅读前50条记录,看看它是否有效,但事实并非如此。我只有几个空列。这里有明显的问题吗?我检查了所有列名和字符编号,工作目录和文件名是否正确。

以下是记录布局的链接,以便您了解为何以这种方式完成:http://www.hcup-us.ahrq.gov/db/nation/kid/tools/stats/FileSpecifications_KID_2012_Core.TXT

input = file("KID2012Core.asc","r")
numRows = 50;

df = data.frame(row=seq(1,numRows),
HOSP_KID = NA,      
RECNUM = NA,           
AGE = NA,              
AGE_NEONATE = NA,  
AMONTH = NA, 
AWEEKEND = NA,      
DIED = NA,             
DISCWT = NA,            
DISPUNIFORM = NA,   
DQTR = NA,      
DRG = NA,              
DRG24 = NA,             
DRGVER = NA,         
DRG_NoPOA = NA,   
DX1 = NA,          
DX2 = NA,                
DX3 = NA,                
DX4 = NA,                
DX5 = NA,                
DX6 = NA,                
DX7 = NA,                
DX8 = NA,                
DX9 = NA,                
DX10 = NA,               
DX11 = NA,              
DX12 = NA,              
DX13 = NA,              
DX14 = NA,   
DX15 = NA,   
DX16 = NA,   
DX17 = NA,   
DX18 = NA,   
DX19 = NA,   
DX20 = NA,   
DX21 = NA,   
DX22 = NA,   
DX23 = NA,   
DX24 = NA,   
DX25 = NA,   
DXCCS1 = NA,
DXCCS2 = NA,
DXCCS3 = NA,
DXCCS4 = NA,

依此类推142列

for(i in seq(1,numRows)) {
    line = readLines(input,n=1)


df$HOSP_KID[i] = substr(input, 1, 5) 
df$RECNUM[i] = substr(input, 6, 13)
df$AGE[i] = substr(input, 14, 16)            
df$AGE_NEONATE[i] = substr(input, 17, 18)              
df$AMONTH[i] = substr(input, 19, 20) 
df$AWEEKEND[i] = substr(input, 21, 22 )      
df$DIED[i] = substr(input, 23, 24)             
df$DISCWT[i] = substr(input, 25, 35)           
df$DISPUNIFORM[i] = substr(input, 36, 37)  
df$DQTR[i] = substr(input, 38, 39)      
df$DRG[i] = substr(input, 40, 42)           
df$DRG24[i] = substr(input, 43, 45)
df$DRGVER[i] = substr(input, 46, 47)        
df$DRG_NoPOA[i] = substr(input, 48, 50)  
df$DX1[i] = substr(input, 51, 55)          
df$DX2[i] = substr(input, 56, 60)                
df$DX3[i] = substr(input, 61, 65)                 
df$DX4[i] = substr(input, 66, 70)                

等等}

提前致谢!

1 个答案:

答案 0 :(得分:0)

您正在使用固定宽度的文件,这是一些需要阅读的工作。有一个名为read.fwf的基本函数,但它确实要求你计算每列的宽度和空间之间的空格,这可能会有些烦人。

readr包提供了几种在其替代方案read_fwf中设置列规范的选项。在这种情况下,顶部的信息可以很好地使用它的fwf_positions辅助函数:

library(readr)

df <- read_fwf('http://www.hcup-us.ahrq.gov/db/nation/kid/tools/stats/FileSpecifications_KID_2012_Core.TXT', 
               fwf_positions(start = c(1, 5, 10, 27, 31, 61, 65, 69, 71, 76), 
                             end = c(3, 8, 25, 29, 59, 63, 67, 69, 74, NA)),    # use NA here for ragged column
               skip = 20)

df
## # A tibble: 142 × 10
##       X1    X2    X3    X4          X5    X6    X7    X8    X9
##    <chr> <int> <chr> <int>       <chr> <int> <int> <int> <chr>
## 1    KID  2012  Core     1    HOSP_KID     1     5    NA   Num
## 2    KID  2012  Core     2      RECNUM     6    13    NA   Num
## 3    KID  2012  Core     3         AGE    14    16    NA   Num
## 4    KID  2012  Core     4 AGE_NEONATE    17    18    NA   Num
## 5    KID  2012  Core     5      AMONTH    19    20    NA   Num
## 6    KID  2012  Core     6    AWEEKEND    21    22    NA   Num
## 7    KID  2012  Core     7        DIED    23    24    NA   Num
## 8    KID  2012  Core     8      DISCWT    25    35     7   Num
## 9    KID  2012  Core     9 DISPUNIFORM    36    37    NA   Num
## 10   KID  2012  Core    10        DQTR    38    39    NA   Num
## # ... with 132 more rows, and 1 more variables: X10 <chr>

此文件实际上是另一个文件的代码簿,但您可以使用X6X7中的位置来等效地读取它;可能像

df2 <- read_fwf('a_big_fwf.txt',
                fwf_positions(start = df$X6, stop = df$X7, col_names = df$X5),
                n_max = 50)    # read first 50 lines

如果您打算阅读整篇文章,也可以指定col_types;信息位于X9,但您需要将其转换为readr理解的格式。有关详细信息,请参阅vignette('column-types', package = 'readr')