我有一个.txt文件,我认为是从STATA输出的,但我不确定。它是一个表格列表,如下所示:
Q1 | Freq. Percent Cum.
------------+-----------------------------------
answer | 35 21.08 21.08
text | 4 2.41 23.49
words | 35 21.08 44.5
something | 38 22.89 67.47
blah | 54 32.53 100.00
------------+-----------------------------------
Total | 166 100.00
Q2 | Freq. Percent Cum.
------------------+-----------------------------------
foo | 1 0.60 0.60
blahblah | 11 6.63 7.23
etc | 26 15.66 22.89
more text | 82 49.40 72.29
answer | 7 4.22 76.51
survey response | 39 23.49 100.00
------------------+-----------------------------------
Total | 166 100.00
Q3 | Freq. Percent Cum.
------------+-----------------------------------
option | 7 4.22 4.22
text | 24 14.46 18.67
blahb | 25 15.06 33.73
more text | 82 49.40 83.13
etc | 28 16.87 100.00
------------+-----------------------------------
Total | 166 100.00
大约有200个问题及其各自的调查答案。有谁知道如何快速将每个调查问题读入R中的单独数据框?
答案 0 :(得分:3)
无需scan()
:
txt <- " Q1 | Freq. Percent Cum.
------------+-----------------------------------
answer | 35 21.08 21.08
text | 4 2.41 23.49
words | 35 21.08 44.5
something | 38 22.89 67.47
blah | 54 32.53 100.00
------------+-----------------------------------
Total | 166 100.00
Q2 | Freq. Percent Cum.
------------------+-----------------------------------
foo | 1 0.60 0.60
blahblah | 11 6.63 7.23
etc | 26 15.66 22.89
more text | 82 49.40 72.29
answer | 7 4.22 76.51
survey response | 39 23.49 100.00
------------------+-----------------------------------
Total | 166 100.00
Q3 | Freq. Percent Cum.
------------+-----------------------------------
option | 7 4.22 4.22
text | 24 14.46 18.67
blahb | 25 15.06 33.73
more text | 82 49.40 83.13
etc | 28 16.87 100.00
------------+-----------------------------------
Total | 166 100.00"
library(purrr)
您可以轻松地从文件中读取上述文本向量。这里的主要目标是从数据中删除cruft并将其转换为我们可以使用的形式,因此我们摆脱了虚线和Total
行,并将空格转换为逗号。这对您的数据格式做出了很大的假设,因此需要保持一致。
readLines(textConnection(txt)) %>%
discard(~grepl("(----|Total)", .)) %>%
gsub("[[:space:]]*\\|[[:space:]]*", ",", .) %>%
gsub("[[:space:]][[:space:]]+", ",", .) %>%
gsub("^,", "", .) -> lines
表格之间有一个空白行。这是代码所做的另一个假设。我们找到空白行并提取空白之间的行(包括文本的开头和结尾)。然后我们将其读入包含read.csv
的数据框。
starts <- c(1, which(lines=="")+1)
ends <- c(which(lines=="")-1, length(lines))
map2(starts, ends, function(start, end) {
read.csv(textConnection(lines[start:end]), stringsAsFactors=FALSE)
})
这会产生一个数据框列表:
## [[1]]
## Q1 Freq. Percent Cum.
## 1 answer 35 21.08 21.08
## 2 text 4 2.41 23.49
## 3 words 35 21.08 44.50
## 4 something 38 22.89 67.47
## 5 blah 54 32.53 100.00
##
## [[2]]
## Q2 Freq. Percent Cum.
## 1 foo 1 0.60 0.60
## 2 blahblah 11 6.63 7.23
## 3 etc 26 15.66 22.89
## 4 more text 82 49.40 72.29
## 5 answer 7 4.22 76.51
## 6 survey response 39 23.49 100.00
##
## [[3]]
## Q3 Freq. Percent Cum.
## 1 option 7 4.22 4.22
## 2 text 24 14.46 18.67
## 3 blahb 25 15.06 33.73
## 4 more text 82 49.40 83.13
## 5 etc 28 16.87 100.00
但是,我认为这可能作为一个大数据框架更有用:
map2_df(starts, ends, function(start, end) {
df <- read.csv(textConnection(lines[start:end]), stringsAsFactors=FALSE)
colnames(df) %>%
tolower() %>%
gsub("\\.", "", .) -> cols
question <- cols[1]
cols[1] <- "text"
setNames(df, cols) %>%
mutate(question=question) %>%
mutate(n=1:nrow(.)) %>%
select(question, n, text, freq, percent, cum) %>%
mutate(percent=percent/100, cum=cum/100)
})
## question n text freq percent cum
## 1 q1 1 answer 35 0.2108 0.2108
## 2 q1 2 text 4 0.0241 0.2349
## 3 q1 3 words 35 0.2108 0.4450
## 4 q1 4 something 38 0.2289 0.6747
## 5 q1 5 blah 54 0.3253 1.0000
## 6 q2 1 foo 1 0.0060 0.0060
## 7 q2 2 blahblah 11 0.0663 0.0723
## 8 q2 3 etc 26 0.1566 0.2289
## 9 q2 4 more text 82 0.4940 0.7229
## 10 q2 5 answer 7 0.0422 0.7651
## 11 q2 6 survey response 39 0.2349 1.0000
## 12 q3 1 option 7 0.0422 0.0422
## 13 q3 2 text 24 0.1446 0.1867
## 14 q3 3 blahb 25 0.1506 0.3373
## 15 q3 4 more text 82 0.4940 0.8313
## 16 q3 5 etc 28 0.1687 1.0000