Question

考虑a Stata .dct file中的以下几行，它们为Stata定义如何阅读this fixed width ASCII file（可以在任何平台上使用任何ZIP软件解压缩）：

start             type                            varname width  description
_column(24)       long                               rfv1   %5f  Patient's Reason for Visit #1            
_column(29)       long                               rfv2   %5f  Patient's Reason for Visit #2             
_column(34)       long                               rfv3   %5f  Patient's Reason for Visit #3             
_column(24)       long                             rfv13d   %4f  Patient's Reason for Visit #1 - broad     
_column(29)       long                             rfv23d   %4f  Patient's Reason for Visit #2 - broad     
_column(34)       long                             rfv33d   %4f  Patient's Reason for Visit #3 - broad

基本上，此ASCII文件的每一行中的第24到第39个字符如下所示：

AAAAaBBBBbCCCCc

如果第一个宽代码是AAAA，那么相同原因的较窄代码是AAAAa等。

换句话说，因为代码本身具有层次结构，所以每行中相同的字符被读取两次以创建两个不同的变量。

相比之下，

read.fwf只需要一个widths参数，这就排除了这种双重阅读。

有没有一种标准的方法来处理这个问题，而不是通过scan整个文件从头开始重新创建轮子并手工解析它？

这里的背景是我正在编写一个函数来解析这些.DCT文件，采用SAScii的风格，如果我可以为每个变量指定(start, width)对而不仅仅是我的工作会更简单widths。

Answer 1

我开始研究.DCT解析器，但却失去了动力。我的预期用法场景是实际上简单地解析文件并创建csvkit schema file以允许我使用csvkit将文件从固定宽度转换为csv。为此，该软件包是成功的，但它非常精致，只是经过极少的测试。

要注意的几个问题包括（1）并非所有DCT文件都具有相同的列; （2）一些DCT文件有隐式小数位的指令，我从来没有想过用它来处理这些类型的文件。

您可以在包here上找到初始工作。

主要功能是：

dct.parser - 你会期待什么。有一个“预览”参数，它在前几行中读取，以确定DCT文件是否具有您期望的所有列。
csvkit.schema - 使用从dct.parser中提取的信息，使用csvkit中in2csv所需的相关列创建csv文件。
csvkit.fwf2csv - 基本上是对{csvkit}的system调用。也可以在R。

对于您的特定示例，我使用以下方式成功阅读：

## The extracted data file and the DCT file are in my downloads directory
setwd("~/Downloads/") 
dct.parser("ed02.dct", preview=TRUE) ## It seems that everything is there
temp <- dct.parser("ed02.dct")       ## Can be used as a lookup table later

## The next line automatically creates a csv schema file in your 
##   working directory named, by default, "your-dct-filename.csv"
csvkit.schema(temp) 
csvkit.fwf2csv(datafile = "ED02", schema="ed02.dct.csv", output="ED02.csv")

## I haven't set up any mechanism to check on progress...
## Just check the directory and see when the file stops growing :)
ED02 <- read.csv("ED02.csv")

我打算使用的另一个替代方案（但从未做过）是使用paste构造substr命令，sqldf可以使用这些命令读取列中的数据包含重叠数据。有关入门的示例，请参阅this blog post。

更新：`sqldf`示例

如上所述，sqldf可以充分利用dct.parser的输出，并使用substr读取您的数据。以下是您将如何执行此操作的示例：

## The extracted data file and the DCT file are in my downloads directory
setwd("~/Downloads/") 
temp <- dct.parser("ed02.dct")       ## Can be used as a lookup table later

## Construct your "substr" command
GetMe <- paste("select", 
               paste("substr(V1, ", temp$StartPos, ", ",
                     temp$ColWidth, ") `", temp$ColName, "`", 
                     sep = "", collapse = ", "), 
               "from fixed", sep = " ")

## Load "sqldf"
library(sqldf)

fixed <- file("ED02")
ED02 <- sqldf(GetMe, file.format = list(sep = "_"))
dim(ED02)
# [1] 37337   260

可以看出，sqldf行需要稍加修改。特别是，由于sqldf使用read.csv.sql，因此它会将数据中的任何逗号字符视为分隔符。您可以将其更改为数据中不期望的内容。

Answer 2

这只是刚刚用Stata标记的（感谢@Metrics），所以很多Stata爱好者都没有注意到这一点。

从纯粹的Stata观点来看，读取每个5位数long变量然后通过例如提取前4个数字似乎很简单。

. gen rvf13d = floor(rvf13/10)

或以字符串形式读取这些代码然后

. gen rvf13d = substr(rvf13, 1, 4)

因此，您永远不需要两次读取相同的数据。

那就是说，这似乎是一个问题的倾斜，在这个问题中，给出了字典文件，你不想手动编辑几个。

读取固定宽度文件中相同列的倍数

2 个答案:

更新：`sqldf`示例

读取固定宽度文件中相同列的倍数

2 个答案:

更新：sqldf示例

更新：`sqldf`示例