如何读取R中的卡住数据?

时间:2013-08-05 15:28:13

标签: r dataframe

我有一些用空格分隔的数字数据。我尝试使用read.table在R中读取它,但是我遇到了一些行问题,其中缺少空间分隔符。很多变量都粘在了一起。如何正确读取数据?我试图改变一些read.table参数,但这还不够。

原始数据如下: https://dl.dropboxusercontent.com/u/74190377/data.txt

示例数据如下:

structure(list(id = c("60019660101", "60019660102", "60019660103", 
"60019660104", "60019660105", "60019660106", "60019660107", "60019660108", 
"60019660109", "60019660110", "60019660111", "60019660112", "60019660113", 
"60019660114", "60019660115", "60019660116", "60019660117", "60019660118", 
"60019660119-10.6-12.4-11.9-11.6"), name1 = c("4.3", "7.4", "5.8", 
"4.3", "-3.5-12.9", "-6.6-13.3", "-5.7", "-5.0-11.4", "-7.5-12.0", 
"-8.8-15.3-11.5-19.5", "-9.8-16.4-13.1-22.3", "-8.9-17.4-10.9-20.0", 
"-7.3", "-5.8-10.5", "-5.4-13.6", "-9.4-20.4-14.4-26.3", "-7.9-15.6-10.3-19.4", 
"-8.7-11.2-10.5-16.0", "1.3"), name2 = c(".7", "3.8", "3.0", 
"-4.1", "-8.6", "-8.6-16.3", "-7.5", "-8.9-11.0", "-9.6-17.6", 
".0", ".6", "2.4", "-9.2", "-6.9", "-8.3", ".0", "1.2", ".8", 
"34-99.0"), name3 = c("3.4", "5.5", "4.2", "-1.9", "-5.6", "6.1", 
"-6.6", "1.8", "1.6", "20-99.0", "18", "17-99.0", "-8.5", "-8.0", 
"-9.1", "33", "33-99.0", "34-99.0", "-.9"), name4 = c("1.0", 
"1.9", "1.8", "-2.4", "1.5", "21-99.0", "-7.9", "25-99.0", "27-99.0", 
"-.9", "1.5", "-.9", "-9.1", "6.1", ".1", "4.6", "-.9", "-.9", 
"-.9"), name5 = c("1.0", "1.6", "10.9", "7.2", "17-99.0", "-.9", 
"1.0", "-.9", "-.9", "-.9", "-.9", "-.9", "2.4", "25-99.0", "33-99.0", 
"-.9", "-.9", "-.9", "-.9"), name6 = c("-9", "-9", "-9", "7-99.0", 
"-.9", "-.9", "27-99.0", "-.9", "-.9", "-.9", "-.9", "-.9", "20-99.0", 
"-.9", "-.9", "-.9", "-.9", "-.9", "-.9"), name7 = c(3.1, 3.7, 
2.7, -0.9, -0.9, -0.9, -0.9, -0.9, -0.9, -0.9, -0.9, -0.9, -0.9, 
-0.9, -0.9, -0.9, -0.9, -0.9, -0.9), name8 = c(-0.9, -0.9, -0.9, 
-0.9, -0.9, -0.9, -0.9, -0.9, -0.9, -0.9, -0.9, -0.9, -0.9, -0.9, 
-0.9, -0.9, -0.9, -0.9, NA), name9 = c(-0.9, -0.9, -0.9, -0.9, 
-0.9, -0.9, -0.9, -0.9, -0.9, NA, -0.9, NA, -0.9, -0.9, -0.9, 
-0.9, NA, NA, NA), name10 = c(-0.9, -0.9, -0.9, -0.9, -0.9, NA, 
-0.9, NA, NA, NA, NA, NA, -0.9, -0.9, -0.9, NA, NA, NA, NA), 
    name11 = c(9.6, 7.8, 9, -0.9, NA, NA, -0.9, NA, NA, NA, NA, 
    NA, -0.9, NA, NA, NA, NA, NA, NA), name12 = c(-0.9, -0.9, 
    -0.9, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA)), .Names = c("id", "name1", "name2", "name3", 
"name4", "name5", "name6", "name7", "name8", "name9", "name10", 
"name11", "name12"), class = "data.frame", row.names = c(NA, 
-19L))

这是我的(坏)输出:

                                id               name1     name2   name3   name4   name5   name6 name7 name8 name9 name10 name11 name12
1                      60019660101                 4.3        .7     3.4     1.0     1.0      -9   3.1  -0.9  -0.9   -0.9    9.6   -0.9
2                      60019660102                 7.4       3.8     5.5     1.9     1.6      -9   3.7  -0.9  -0.9   -0.9    7.8   -0.9
3                      60019660103                 5.8       3.0     4.2     1.8    10.9      -9   2.7  -0.9  -0.9   -0.9    9.0   -0.9
4                      60019660104                 4.3      -4.1    -1.9    -2.4     7.2  7-99.0  -0.9  -0.9  -0.9   -0.9   -0.9     NA
5                      60019660105           -3.5-12.9      -8.6    -5.6     1.5 17-99.0     -.9  -0.9  -0.9  -0.9   -0.9     NA     NA
6                      60019660106           -6.6-13.3 -8.6-16.3     6.1 21-99.0     -.9     -.9  -0.9  -0.9  -0.9     NA     NA     NA
7                      60019660107                -5.7      -7.5    -6.6    -7.9     1.0 27-99.0  -0.9  -0.9  -0.9   -0.9   -0.9     NA
8                      60019660108           -5.0-11.4 -8.9-11.0     1.8 25-99.0     -.9     -.9  -0.9  -0.9  -0.9     NA     NA     NA
9                      60019660109           -7.5-12.0 -9.6-17.6     1.6 27-99.0     -.9     -.9  -0.9  -0.9  -0.9     NA     NA     NA
10                     60019660110 -8.8-15.3-11.5-19.5        .0 20-99.0     -.9     -.9     -.9  -0.9  -0.9    NA     NA     NA     NA
11                     60019660111 -9.8-16.4-13.1-22.3        .6      18     1.5     -.9     -.9  -0.9  -0.9  -0.9     NA     NA     NA
12                     60019660112 -8.9-17.4-10.9-20.0       2.4 17-99.0     -.9     -.9     -.9  -0.9  -0.9    NA     NA     NA     NA
13                     60019660113                -7.3      -9.2    -8.5    -9.1     2.4 20-99.0  -0.9  -0.9  -0.9   -0.9   -0.9     NA
14                     60019660114           -5.8-10.5      -6.9    -8.0     6.1 25-99.0     -.9  -0.9  -0.9  -0.9   -0.9     NA     NA
15                     60019660115           -5.4-13.6      -8.3    -9.1      .1 33-99.0     -.9  -0.9  -0.9  -0.9   -0.9     NA     NA
16                     60019660116 -9.4-20.4-14.4-26.3        .0      33     4.6     -.9     -.9  -0.9  -0.9  -0.9     NA     NA     NA
17                     60019660117 -7.9-15.6-10.3-19.4       1.2 33-99.0     -.9     -.9     -.9  -0.9  -0.9    NA     NA     NA     NA
18                     60019660118 -8.7-11.2-10.5-16.0        .8 34-99.0     -.9     -.9     -.9  -0.9  -0.9    NA     NA     NA     NA
19 60019660119-10.6-12.4-11.9-11.6                 1.3   34-99.0     -.9     -.9     -.9     -.9  -0.9    NA    NA     NA     NA     NA

这是正确数据的外观:

  60019660101  4.3    .7     3.4     1.0    1.0   -9     3.1    -.9  -.9  -.9  9.6  -.9
  60019660102  7.4   3.8     5.5     1.9    1.6   -9     3.7    -.9  -.9  -.9  7.8  -.9
  60019660103  5.8   3.0     4.2     1.8    10.9  -9     2.7    -.9  -.9  -.9  9.0  -.9
  60019660104  4.3  -4.1    -1.9    -2.4    7.2      7  -99.0   -.9  -.9  -.9  -.9  -.9
  60019660105 -3.5  -12.9   -8.6    -5.6    1.5     17  -99.0   -.9  -.9  -.9  -.9  -.9
  60019660106 -6.6  -13.3   -8.6    -16.3   6.1     21  -99.0   -.9  -.9  -.9  -.9  -.9
  60019660107 -5.7  -7.5    -6.6    -7.9    1.0     27  -99.0   -.9  -.9  -.9  -.9  -.9
  60019660108 -5.0  -11.4   -8.9    -11.0   1.8     25  -99.0   -.9  -.9  -.9  -.9  -.9
  60019660109 -7.5  -12.0   -9.6    -17.6   1.6     27  -99.0   -.9  -.9  -.9  -.9  -.9
  60019660110 -8.8  -15.3   -11.5   -19.5    .0     20  -99.0   -.9  -.9  -.9  -.9  -.9
  60019660111 -9.8  -16.4   -13.1   -22.3    .6     18    1.5   -.9  -.9  -.9  -.9  -.9
  60019660112 -8.9  -17.4   -10.9   -20.0   2.4     17  -99.0   -.9  -.9  -.9  -.9  -.9
  60019660113 -7.3  -9.2    -8.5    -9.1    2.4     20  -99.0   -.9  -.9  -.9  -.9  -.9
  60019660114 -5.8  -10.5   -6.9    -8.0    6.1     25  -99.0   -.9  -.9  -.9  -.9  -.9
  60019660115 -5.4  -13.6   -8.3    -9.1     .1     33  -99.0   -.9  -.9  -.9  -.9  -.9
  60019660116 -9.4  -20.4   -14.4   -26.3    .0     33    4.6   -.9  -.9  -.9  -.9  -.9
  60019660117 -7.9  -15.6   -10.3   -19.4   1.2     33  -99.0   -.9  -.9  -.9  -.9  -.9
  60019660118 -8.7  -11.2   -10.5   -16.0    .8     34  -99.0   -.9  -.9  -.9  -.9  -.9
  60019660119 -10.6 -12.4   -11.9   -11.6   1.3     34  -99.0   -.9  -.9  -.9  -.9  -.9

2 个答案:

答案 0 :(得分:5)

您似乎有固定宽度格式的数据。

read.fwf("https://dl.dropboxusercontent.com/u/74190377/data.txt",
         widths=c(13,5,5,5,5,7,4,5,5,5,5,5,5))

#            V1    V2    V3    V4    V5   V6 V7    V8   V9  V10  V11  V12  V13
#1  60019660101   4.3   0.7   3.4   1.0  1.0 -9   3.1 -0.9 -0.9 -0.9  9.6 -0.9
#2  60019660102   7.4   3.8   5.5   1.9  1.6 -9   3.7 -0.9 -0.9 -0.9  7.8 -0.9
#3  60019660103   5.8   3.0   4.2   1.8 10.9 -9   2.7 -0.9 -0.9 -0.9  9.0 -0.9
#4  60019660104   4.3  -4.1  -1.9  -2.4  7.2  7 -99.0 -0.9 -0.9 -0.9 -0.9 -0.9
#5  60019660105  -3.5 -12.9  -8.6  -5.6  1.5 17 -99.0 -0.9 -0.9 -0.9 -0.9 -0.9
<snip>

答案 1 :(得分:1)

我会对原始文​​件进行正则表达式修复。任何编辑器(甚至是MSoft Word!)都可以这样做:

找到“ - ”

REPLACE“(空格或制表符) - ”

更换所有

之后,read.table应该可以正常工作。