Question

我有一个具有以下格式的.txt文件：

--------------------------------------------------------------------------------------------------------------
m5a2                                                     A2. Confirm how much time child lives with respondent
--------------------------------------------------------------------------------------------------------------

                  type:  numeric (byte)
                 label:  BM_101F

                 range:  [-9,7]                       units:  1
         unique values:  8                        missing .:  0/4898

            tabulation:  Freq.   Numeric  Label
                          1383        -9  -9 Not in wave
                             4        -2  -2 Don't know
                             2        -1  -1 Refuse
                          3272         1  1 all or most of the time
                            29         2  2 about half of the time
                            76         3  3 some of the time
                            80         4  4 none of the time
                            52         7  7 only on weekends

--------------------------------------------------------------------------------------------------------------
m5a3                                                    A3. Number of months ago child stopped living with you
--------------------------------------------------------------------------------------------------------------

                  type:  numeric (int)
                 label:  NUMERIC, but 44 nonmissing values are not labeled

                 range:  [-9,120]                     units:  1
         unique values:  47                       missing .:  0/4898

              examples:  -9    -9 Not in wave
                         -6    -6 Skip
                         -6    -6 Skip
                         -6    -6 Skip

--------------------------------------------------------------------------------------------------------------

对我来说重要的是代号，例如m5a2，说明A2. Confirm how much time child lives with respondent，最后是回复的值

tabulation:  Freq.   Numeric  Label
                          1383        -9  -9 Not in wave
                             4        -2  -2 Don't know
                             2        -1  -1 Refuse
                          3272         1  1 all or most of the time
                            29         2  2 about half of the time
                            76         3  3 some of the time
                            80         4  4 none of the time
                            52         7  7 only on weekends

我需要将这三个项目读入列表以供进一步处理。

我尝试了以下内容，它可以检索代号和说明。

fileName <- "../data/ff_mom_cb9.txt"
conn <- file(fileName,open="r")
linn <-readLines(conn)
L = list()
for (i in 1:length(linn)){
  if((linn[i]=="--------------------------------------------------------------------------------------------------------------") & (linn[i+1]!=""))
  {
    L[i] = linn[i+1]
  }

  else
  {
    # read until hit the next dashed line
  }
}
close(conn)

我很困惑的一些事情： 1.我不知道如何让它读取直到下一个虚线的下一行。 2.如果我希望能够可视化搜索并轻松检索数据，我的方法是否正确将读取数据存储在列表中？

感谢。

Answer 1

这会有些问题，因为每个项目的格式都是不规则的。下面是第一项代码簿文本的运行：

txt <- "m5a2                                                     A2. Confirm how much time child lives with respondent
--------------------------------------------------------------------------------------------------------------

                  type:  numeric (byte)
                 label:  BM_101F

                 range:  [-9,7]                       units:  1
         unique values:  8                        missing .:  0/4898

            tabulation:  Freq.   Numeric  Label
                          1383        -9  -9 Not in wave
                             4        -2  -2 Don't know
                             2        -1  -1 Refuse
                          3272         1  1 all or most of the time
                            29         2  2 about half of the time
                            76         3  3 some of the time
                            80         4  4 none of the time
                            52         7  7 only on weekends
"
Lines <- readLines( textConnection(txt))
 # isolate lines with letter in first column
 Lines[grep("^[a-zA-Z]", Lines)]
# Now replace long runs of spaces with commas and scan:

scan(text=sub("[ ]{10,100}", ",", Lines[grep("^[a-zA-Z]", Lines)] ),
     sep=",", what="")
#----
Read 2 items
[1] "m5a2"                                                 
[2] "A2. Confirm how much time child lives with respondent"

＆＃34;制表＆＃34; line可用于创建列标签。

colnames <- scan(text=sub(".*tabulation[:]", "",
                     Lines[grep("tabulation[:]", Lines)] ), sep="", what="")
#Read 3 items

随后用逗号替换策略需要更多地涉及这些行。首先隔离数字是第一个非空格字符的行：

dataRows <- Lines[grep("^[ ]*\\d", Lines)]

然后用逗号代替数字-2 +空格，并用read.csv：

读取

 myDat <- read.csv(text=  
                      gsub("(\\d)[ ]{2,}", "\\1,", dataRows ), 
                   header=FALSE ,col.names=colnames)

#------------
 myDat
    V1 V2                        V3
1 1383 -9            -9 Not in wave
2    4 -2             -2 Don't know
3    2 -1                 -1 Refuse
4 3272  1 1 all or most of the time
5   29  2  2 about half of the time
6   76  3        3 some of the time
7   80  4        4 none of the time
8   52  7        7 only on weekends

如果Lines-object是整个文件，例如：{/ p>，那么使用cumsum( grepl("^-------", Lines)生成的计数器可以循环使用多个项目。

 Lines <- readLines("http://fragilefamilies.princeton.edu/sites/fragilefamilies/files/ff_mom_cb9.txt")
sum( grepl("^-------", Lines) )
#----------------------
[1] 1966
Warning messages:
1: In grepl("^-------", Lines) :
  input string 6995 is invalid in this locale
2: In grepl("^-------", Lines) :
  input string 7349 is invalid in this locale
3: In grepl("^-------", Lines) :
  input string 7350 is invalid in this locale
4: In grepl("^-------", Lines) :
  input string 7352 is invalid in this locale
5: In grepl("^-------", Lines) :
  input string 7353 is invalid in this locale

我的＆＃34;手持式扫描（） - 呃＆＃34;向我建议，只有两种类型的码本记录：＆＃34;制表＆＃34; （可能是少于10个左右的项目）和＃34;例子＆＃34;（有更多的项目）。它们具有不同的结构（如上面的代码片段中所示），因此可能只需要构建和部署两种类型的解析逻辑。所以我认为上面描述的工具如上所述。

警告都与角色＆＃34; \ x92＆＃34;被用作撇号。正则表达式和R共享一个转义字符＆＃34; \＆＃34;，所以你需要逃脱逃脱。可以通过以下方式纠正它们：

Lines <- gsub("\\\x92", "'", Lines )

Answer 2

这个怎么样？

df <- read.table("file.txt", 
             header = FALSE)
df

在r中读取文本文件并在下一行存储读取条件

2 个答案: