我是所有这些的新手:编程,统计和R.我正在尝试将大型数据集加载到R.它是ASC格式。我已经尝试了很多小时,从read.table,到rgdal,再到read.asc都没有成功。该文件是1.5 GB,所以我无法在文本编辑器中打开它。我有一个导师,他说需要逐行阅读。计划是阅读前50条记录,看看它是否有效,但事实并非如此。我只有几个空列。这里有明显的问题吗?我检查了所有列名和字符编号,工作目录和文件名是否正确。
以下是记录布局的链接,以便您了解为何以这种方式完成:http://www.hcup-us.ahrq.gov/db/nation/kid/tools/stats/FileSpecifications_KID_2012_Core.TXT
input = file("KID2012Core.asc","r")
numRows = 50;
df = data.frame(row=seq(1,numRows),
HOSP_KID = NA,
RECNUM = NA,
AGE = NA,
AGE_NEONATE = NA,
AMONTH = NA,
AWEEKEND = NA,
DIED = NA,
DISCWT = NA,
DISPUNIFORM = NA,
DQTR = NA,
DRG = NA,
DRG24 = NA,
DRGVER = NA,
DRG_NoPOA = NA,
DX1 = NA,
DX2 = NA,
DX3 = NA,
DX4 = NA,
DX5 = NA,
DX6 = NA,
DX7 = NA,
DX8 = NA,
DX9 = NA,
DX10 = NA,
DX11 = NA,
DX12 = NA,
DX13 = NA,
DX14 = NA,
DX15 = NA,
DX16 = NA,
DX17 = NA,
DX18 = NA,
DX19 = NA,
DX20 = NA,
DX21 = NA,
DX22 = NA,
DX23 = NA,
DX24 = NA,
DX25 = NA,
DXCCS1 = NA,
DXCCS2 = NA,
DXCCS3 = NA,
DXCCS4 = NA,
依此类推142列
for(i in seq(1,numRows)) {
line = readLines(input,n=1)
df$HOSP_KID[i] = substr(input, 1, 5)
df$RECNUM[i] = substr(input, 6, 13)
df$AGE[i] = substr(input, 14, 16)
df$AGE_NEONATE[i] = substr(input, 17, 18)
df$AMONTH[i] = substr(input, 19, 20)
df$AWEEKEND[i] = substr(input, 21, 22 )
df$DIED[i] = substr(input, 23, 24)
df$DISCWT[i] = substr(input, 25, 35)
df$DISPUNIFORM[i] = substr(input, 36, 37)
df$DQTR[i] = substr(input, 38, 39)
df$DRG[i] = substr(input, 40, 42)
df$DRG24[i] = substr(input, 43, 45)
df$DRGVER[i] = substr(input, 46, 47)
df$DRG_NoPOA[i] = substr(input, 48, 50)
df$DX1[i] = substr(input, 51, 55)
df$DX2[i] = substr(input, 56, 60)
df$DX3[i] = substr(input, 61, 65)
df$DX4[i] = substr(input, 66, 70)
等等}
提前致谢!
答案 0 :(得分:0)
您正在使用固定宽度的文件,这是一些需要阅读的工作。有一个名为read.fwf
的基本函数,但它确实要求你计算每列的宽度和空间之间的空格,这可能会有些烦人。
readr
包提供了几种在其替代方案read_fwf
中设置列规范的选项。在这种情况下,顶部的信息可以很好地使用它的fwf_positions
辅助函数:
library(readr)
df <- read_fwf('http://www.hcup-us.ahrq.gov/db/nation/kid/tools/stats/FileSpecifications_KID_2012_Core.TXT',
fwf_positions(start = c(1, 5, 10, 27, 31, 61, 65, 69, 71, 76),
end = c(3, 8, 25, 29, 59, 63, 67, 69, 74, NA)), # use NA here for ragged column
skip = 20)
df
## # A tibble: 142 × 10
## X1 X2 X3 X4 X5 X6 X7 X8 X9
## <chr> <int> <chr> <int> <chr> <int> <int> <int> <chr>
## 1 KID 2012 Core 1 HOSP_KID 1 5 NA Num
## 2 KID 2012 Core 2 RECNUM 6 13 NA Num
## 3 KID 2012 Core 3 AGE 14 16 NA Num
## 4 KID 2012 Core 4 AGE_NEONATE 17 18 NA Num
## 5 KID 2012 Core 5 AMONTH 19 20 NA Num
## 6 KID 2012 Core 6 AWEEKEND 21 22 NA Num
## 7 KID 2012 Core 7 DIED 23 24 NA Num
## 8 KID 2012 Core 8 DISCWT 25 35 7 Num
## 9 KID 2012 Core 9 DISPUNIFORM 36 37 NA Num
## 10 KID 2012 Core 10 DQTR 38 39 NA Num
## # ... with 132 more rows, and 1 more variables: X10 <chr>
此文件实际上是另一个文件的代码簿,但您可以使用X6
和X7
中的位置来等效地读取它;可能像
df2 <- read_fwf('a_big_fwf.txt',
fwf_positions(start = df$X6, stop = df$X7, col_names = df$X5),
n_max = 50) # read first 50 lines
如果您打算阅读整篇文章,也可以指定col_types
;信息位于X9
,但您需要将其转换为readr
理解的格式。有关详细信息,请参阅vignette('column-types', package = 'readr')
。