Question

我正在尝试使用foreign包读取R中的Stata数据集，但是当我尝试使用以下内容读取文件时：

library(foreign)
data <- read.dta("data.dta")

我收到以下错误：

Error in read.dta("data.dta") : a binary read error occurred

该文件在Stata中运行良好。这个site建议将文件保存在没有标签的Stata中，然后将其读入R.通过此解决方法，我可以将文件加载到R中，但之后我丢失了标签。为什么我会收到此错误，如何使用标签将文件读入R？另一个person发现当它们具有没有值的变量时会出现此错误。我的数据确实至少有一两个这样的变量，但我没有简单的方法来识别stata中的那些变量。它是一个包含数千个变量的非常大的文件。

Answer 1

在阅读Stata数据之前，您应该致电library(foreign)。

library(foreign)
data <- read.dta("data.dta")

更新：如上所述here，

“错误消息表示找到了文件，并且已启动使用正确的字节序列作为Stata .dta文件，但是某些东西（可能是文件的末尾）阻止R读取它的内容期待阅读。 “

但是，我们可能只是猜测而没有任何进一步的信息。

更新OP的问答：

我已经尝试过是否使用Stata的自动数据，但不是。所以，应该有其他原因：

*声明1和2：如果变量中存在缺失或存在带标签的数据集，则R read.dta将生成错误*

sysuse auto #this dataset has labels
replace mpg=. #generates missing for mpg variable
br in 1/10
make    price   mpg rep78   headroom    trunk   weight  length  turn    displacement    gear_ratio  foreign
AMC Concord 4099        3   2.5 11  2930    186 40  121 3.58    Domestic
AMC Pacer   4749        3   3.0 11  3350    173 40  258 2.53    Domestic
AMC Spirit  3799            3.0 12  2640    168 35  121 3.08    Domestic
Buick Century   4816        3   4.5 16  3250    196 40  196 2.93    Domestic
Buick Electra   7827        4   4.0 20  4080    222 43  350 2.41    Domestic
Buick LeSabre   5788        3   4.0 21  3670    218 43  231 2.73    Domestic
Buick Opel  4453            3.0 10  2230    170 34  304 2.87    Domestic
Buick Regal 5189        3   2.0 16  3280    200 42  196 2.93    Domestic
Buick Riviera   10372       3   3.5 17  3880    207 43  231 2.93    Domestic
Buick Skylark   4082        3   3.5 13  3400    200 42  231 3.08    Domestic

save "~myauto"
de(myauto)

Contains data from ~\myauto.dta
  obs:            74                          1978 Automobile Data
 vars:            12                          25 Aug 2013 11:32
 size:         3,478 (99.9% of memory free)   (_dta has notes)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
              storage  display     value
variable name   type   format      label      variable label
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
make            str18  %-18s                  Make and Model
price           int    %8.0gc                 Price
mpg             int    %8.0g                  Mileage (mpg)
rep78           int    %8.0g                  Repair Record 1978
headroom        float  %6.1f                  Headroom (in.)
trunk           int    %8.0g                  Trunk space (cu. ft.)
weight          int    %8.0gc                 Weight (lbs.)
length          int    %8.0g                  Length (in.)
turn            int    %8.0g                  Turn Circle (ft.)
displacement    int    %8.0g                  Displacement (cu. in.)
gear_ratio      float  %6.2f                  Gear Ratio
foreign         byte   %8.0g       origin     Car type
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorted by:  foreign


library(foreign)
myauto<-read.dta("myauto.dta")  #works perfect
    str(myauto)
'data.frame':   74 obs. of  12 variables:
 $ make        : chr  "AMC Concord" "AMC Pacer" "AMC Spirit" "Buick Century" ...
 $ price       : int  4099 4749 3799 4816 7827 5788 4453 5189 10372 4082 ...
 $ mpg         : int  NA NA NA NA NA NA NA NA NA NA ...
 $ rep78       : int  3 3 NA 3 4 3 NA 3 3 3 ...
 $ headroom    : num  2.5 3 3 4.5 4 4 3 2 3.5 3.5 ...
 $ trunk       : int  11 11 12 16 20 21 10 16 17 13 ...
 $ weight      : int  2930 3350 2640 3250 4080 3670 2230 3280 3880 3400 ...
 $ length      : int  186 173 168 196 222 218 170 200 207 200 ...
 $ turn        : int  40 40 35 40 43 43 34 42 43 42 ...
 $ displacement: int  121 258 121 196 350 231 304 196 231 231 ...
 $ gear_ratio  : num  3.58 2.53 3.08 2.93 2.41 ...
 $ foreign     : Factor w/ 2 levels "Domestic","Foreign": 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "datalabel")= chr "1978 Automobile Data"
 - attr(*, "time.stamp")= chr "25 Aug 2013 11:23"
 - attr(*, "formats")= chr  "%-18s" "%8.0gc" "%8.0g" "%8.0g" ...
 - attr(*, "types")= int  18 252 252 252 254 252 252 252 252 252 ...
 - attr(*, "val.labels")= chr  "" "" "" "" ...
 - attr(*, "var.labels")= chr  "Make and Model" "Price" "Mileage (mpg)" "Repair Record 1978" ...
 - attr(*, "expansion.fields")=List of 2
  ..$ : chr  "_dta" "note1" "from Consumer Reports with permission"
  ..$ : chr  "_dta" "note0" "1"
 - attr(*, "version")= int 12
 - attr(*, "label.table")=List of 1
  ..$ origin: Named int  0 1
  .. ..- attr(*, "names")= chr  "Domestic" "Foreign"

Answer 2

这是一个求解器列表。我的猜测是第一项有75％的可能性来解决你的问题。

在Stata中，使用dta重新保存saveold文件的新副本，然后重试。
如果失败，请提供示例以显示哪种值会导致read.dta函数失效。
如果要归咎于缺失值，请从另一个答案运行循环。

要经过这一点，需要对数据集进行更全面的描述。这个问题似乎已经解决了，我使用foreign和大量的Stata文件时遇到了很多麻烦。

您也可以试试Stata.file包中的memisc函数，看看是否也失败了。

Answer 3

我不知道为什么会发生这种情况，并且如果有人能够解释会感兴趣，但read.dta确实无法处理全部NA的变量。解决方案是使用以下code：

在Stata中删除此类变量

foreach varname of varlist * {
 quietly sum `varname'
 if `r(N)'==0 {
  drop `varname'
  disp "dropped `varname' for too much missing data"
 }
}

Answer 4

这花了很多时间，但我解决了将.dta数据导出到.csv的同样问题。问题与因子变量的标签有关，特别是因为标签是西班牙语并且ASCII编码是一团糟。我希望这对有相同问题的人和Stata软件有用。

在stata：

export delimited using "/Users/data.csv", nolabel replace

在R：

df <- read.csv("lapop2014.csv")

读取R中的Stata数据时出错

4 个答案: