Question

我有一个.txt文件，我认为是从STATA输出的，但我不确定。它是一个表格列表，如下所示：

         Q1 |      Freq.     Percent        Cum.
------------+-----------------------------------
     answer |         35       21.08       21.08
       text |          4        2.41       23.49
      words |         35       21.08       44.5
  something |         38       22.89       67.47
       blah |         54       32.53      100.00
------------+-----------------------------------
      Total |        166      100.00

               Q2 |      Freq.     Percent        Cum.
------------------+-----------------------------------
              foo |          1        0.60        0.60
         blahblah |         11        6.63        7.23
              etc |         26       15.66       22.89
        more text |         82       49.40       72.29
           answer |          7        4.22       76.51
  survey response |         39       23.49      100.00
------------------+-----------------------------------
            Total |        166      100.00

         Q3 |      Freq.     Percent        Cum.
------------+-----------------------------------
     option |          7        4.22        4.22
       text |         24       14.46       18.67
      blahb |         25       15.06       33.73
  more text |         82       49.40       83.13
        etc |         28       16.87      100.00
------------+-----------------------------------
      Total |        166      100.00

大约有200个问题及其各自的调查答案。有谁知道如何快速将每个调查问题读入R中的单独数据框？

Answer 1

无需scan()：

txt <- "         Q1 |      Freq.     Percent        Cum.
------------+-----------------------------------
answer |         35       21.08       21.08
text |          4        2.41       23.49
words |         35       21.08       44.5
something |         38       22.89       67.47
blah |         54       32.53      100.00
------------+-----------------------------------
Total |        166      100.00

Q2 |      Freq.     Percent        Cum.
------------------+-----------------------------------
foo |          1        0.60        0.60
blahblah |         11        6.63        7.23
etc |         26       15.66       22.89
more text |         82       49.40       72.29
answer |          7        4.22       76.51
survey response |         39       23.49      100.00
------------------+-----------------------------------
Total |        166      100.00

Q3 |      Freq.     Percent        Cum.
------------+-----------------------------------
option |          7        4.22        4.22
text |         24       14.46       18.67
blahb |         25       15.06       33.73
more text |         82       49.40       83.13
etc |         28       16.87      100.00
------------+-----------------------------------
Total |        166      100.00"


library(purrr)

您可以轻松地从文件中读取上述文本向量。这里的主要目标是从数据中删除cruft并将其转换为我们可以使用的形式，因此我们摆脱了虚线和Total行，并将空格转换为逗号。这对您的数据格式做出了很大的假设，因此需要保持一致。

readLines(textConnection(txt)) %>% 
  discard(~grepl("(----|Total)", .)) %>% 
  gsub("[[:space:]]*\\|[[:space:]]*", ",", .) %>% 
  gsub("[[:space:]][[:space:]]+", ",", .) %>% 
  gsub("^,", "", .) -> lines

表格之间有一个空白行。这是代码所做的另一个假设。我们找到空白行并提取空白之间的行（包括文本的开头和结尾）。然后我们将其读入包含read.csv的数据框。

starts <- c(1, which(lines=="")+1)
ends <- c(which(lines=="")-1, length(lines))

map2(starts, ends, function(start, end) {
  read.csv(textConnection(lines[start:end]), stringsAsFactors=FALSE)
})

这会产生一个数据框列表：

## [[1]]
##          Q1 Freq. Percent   Cum.
## 1    answer    35   21.08  21.08
## 2      text     4    2.41  23.49
## 3     words    35   21.08  44.50
## 4 something    38   22.89  67.47
## 5      blah    54   32.53 100.00
## 
## [[2]]
##                Q2 Freq. Percent   Cum.
## 1             foo     1    0.60   0.60
## 2        blahblah    11    6.63   7.23
## 3             etc    26   15.66  22.89
## 4       more text    82   49.40  72.29
## 5          answer     7    4.22  76.51
## 6 survey response    39   23.49 100.00
## 
## [[3]]
##          Q3 Freq. Percent   Cum.
## 1    option     7    4.22   4.22
## 2      text    24   14.46  18.67
## 3     blahb    25   15.06  33.73
## 4 more text    82   49.40  83.13
## 5       etc    28   16.87 100.00

但是，我认为这可能作为一个大数据框架更有用：

map2_df(starts, ends, function(start, end) {

  df <- read.csv(textConnection(lines[start:end]), stringsAsFactors=FALSE)

  colnames(df) %>% 
    tolower() %>% 
    gsub("\\.", "", .) -> cols

  question <- cols[1]
  cols[1] <- "text"

  setNames(df, cols) %>% 
    mutate(question=question) %>% 
    mutate(n=1:nrow(.)) %>% 
    select(question, n, text, freq, percent, cum) %>%
    mutate(percent=percent/100, cum=cum/100)

})
##   question n            text freq percent    cum
## 1        q1 1          answer   35  0.2108 0.2108
## 2        q1 2            text    4  0.0241 0.2349
## 3        q1 3           words   35  0.2108 0.4450
## 4        q1 4       something   38  0.2289 0.6747
## 5        q1 5            blah   54  0.3253 1.0000
## 6        q2 1             foo    1  0.0060 0.0060
## 7        q2 2        blahblah   11  0.0663 0.0723
## 8        q2 3             etc   26  0.1566 0.2289
## 9        q2 4       more text   82  0.4940 0.7229
## 10       q2 5          answer    7  0.0422 0.7651
## 11       q2 6 survey response   39  0.2349 1.0000
## 12       q3 1          option    7  0.0422 0.0422
## 13       q3 2            text   24  0.1446 0.1867
## 14       q3 3           blahb   25  0.1506 0.3373
## 15       q3 4       more text   82  0.4940 0.8313
## 16       q3 5             etc   28  0.1687 1.0000

如何将.txt表读入R中

1 个答案: