我有一个具有以下格式的.txt文件:
--------------------------------------------------------------------------------------------------------------
m5a2 A2. Confirm how much time child lives with respondent
--------------------------------------------------------------------------------------------------------------
type: numeric (byte)
label: BM_101F
range: [-9,7] units: 1
unique values: 8 missing .: 0/4898
tabulation: Freq. Numeric Label
1383 -9 -9 Not in wave
4 -2 -2 Don't know
2 -1 -1 Refuse
3272 1 1 all or most of the time
29 2 2 about half of the time
76 3 3 some of the time
80 4 4 none of the time
52 7 7 only on weekends
--------------------------------------------------------------------------------------------------------------
m5a3 A3. Number of months ago child stopped living with you
--------------------------------------------------------------------------------------------------------------
type: numeric (int)
label: NUMERIC, but 44 nonmissing values are not labeled
range: [-9,120] units: 1
unique values: 47 missing .: 0/4898
examples: -9 -9 Not in wave
-6 -6 Skip
-6 -6 Skip
-6 -6 Skip
--------------------------------------------------------------------------------------------------------------
对我来说重要的是代号,例如m5a2
,说明A2. Confirm how much time child lives with respondent
,最后是回复的值
tabulation: Freq. Numeric Label
1383 -9 -9 Not in wave
4 -2 -2 Don't know
2 -1 -1 Refuse
3272 1 1 all or most of the time
29 2 2 about half of the time
76 3 3 some of the time
80 4 4 none of the time
52 7 7 only on weekends
我需要将这三个项目读入列表以供进一步处理。
我尝试了以下内容,它可以检索代号和说明。
fileName <- "../data/ff_mom_cb9.txt"
conn <- file(fileName,open="r")
linn <-readLines(conn)
L = list()
for (i in 1:length(linn)){
if((linn[i]=="--------------------------------------------------------------------------------------------------------------") & (linn[i+1]!=""))
{
L[i] = linn[i+1]
}
else
{
# read until hit the next dashed line
}
}
close(conn)
我很困惑的一些事情: 1.我不知道如何让它读取直到下一个虚线的下一行。 2.如果我希望能够可视化搜索并轻松检索数据,我的方法是否正确将读取数据存储在列表中?
感谢。
答案 0 :(得分:0)
这会有些问题,因为每个项目的格式都是不规则的。下面是第一项代码簿文本的运行:
txt <- "m5a2 A2. Confirm how much time child lives with respondent
--------------------------------------------------------------------------------------------------------------
type: numeric (byte)
label: BM_101F
range: [-9,7] units: 1
unique values: 8 missing .: 0/4898
tabulation: Freq. Numeric Label
1383 -9 -9 Not in wave
4 -2 -2 Don't know
2 -1 -1 Refuse
3272 1 1 all or most of the time
29 2 2 about half of the time
76 3 3 some of the time
80 4 4 none of the time
52 7 7 only on weekends
"
Lines <- readLines( textConnection(txt))
# isolate lines with letter in first column
Lines[grep("^[a-zA-Z]", Lines)]
# Now replace long runs of spaces with commas and scan:
scan(text=sub("[ ]{10,100}", ",", Lines[grep("^[a-zA-Z]", Lines)] ),
sep=",", what="")
#----
Read 2 items
[1] "m5a2"
[2] "A2. Confirm how much time child lives with respondent"
&#34;制表&#34; line可用于创建列标签。
colnames <- scan(text=sub(".*tabulation[:]", "",
Lines[grep("tabulation[:]", Lines)] ), sep="", what="")
#Read 3 items
随后用逗号替换策略需要更多地涉及这些行。首先隔离数字是第一个非空格字符的行:
dataRows <- Lines[grep("^[ ]*\\d", Lines)]
然后用逗号代替数字-2 +空格,并用read.csv:
读取 myDat <- read.csv(text=
gsub("(\\d)[ ]{2,}", "\\1,", dataRows ),
header=FALSE ,col.names=colnames)
#------------
myDat
V1 V2 V3
1 1383 -9 -9 Not in wave
2 4 -2 -2 Don't know
3 2 -1 -1 Refuse
4 3272 1 1 all or most of the time
5 29 2 2 about half of the time
6 76 3 3 some of the time
7 80 4 4 none of the time
8 52 7 7 only on weekends
如果Lines-object是整个文件,例如:{/ p>,那么使用cumsum( grepl("^-------", Lines)
生成的计数器可以循环使用多个项目。
Lines <- readLines("http://fragilefamilies.princeton.edu/sites/fragilefamilies/files/ff_mom_cb9.txt")
sum( grepl("^-------", Lines) )
#----------------------
[1] 1966
Warning messages:
1: In grepl("^-------", Lines) :
input string 6995 is invalid in this locale
2: In grepl("^-------", Lines) :
input string 7349 is invalid in this locale
3: In grepl("^-------", Lines) :
input string 7350 is invalid in this locale
4: In grepl("^-------", Lines) :
input string 7352 is invalid in this locale
5: In grepl("^-------", Lines) :
input string 7353 is invalid in this locale
我的&#34;手持式扫描() - 呃&#34;向我建议,只有两种类型的码本记录:&#34;制表&#34; (可能是少于10个左右的项目)和#34;例子&#34;(有更多的项目)。它们具有不同的结构(如上面的代码片段中所示),因此可能只需要构建和部署两种类型的解析逻辑。所以我认为上面描述的工具如上所述。
警告都与角色&#34; \ x92&#34;被用作撇号。正则表达式和R共享一个转义字符&#34; \&#34;,所以你需要逃脱逃脱。可以通过以下方式纠正它们:
Lines <- gsub("\\\x92", "'", Lines )
答案 1 :(得分:-1)
这个怎么样?
df <- read.table("file.txt",
header = FALSE)
df