我有一些用空格分隔的数字数据。我尝试使用read.table在R中读取它,但是我遇到了一些行问题,其中缺少空间分隔符。很多变量都粘在了一起。如何正确读取数据?我试图改变一些read.table参数,但这还不够。
原始数据如下: https://dl.dropboxusercontent.com/u/74190377/data.txt
示例数据如下:
structure(list(id = c("60019660101", "60019660102", "60019660103",
"60019660104", "60019660105", "60019660106", "60019660107", "60019660108",
"60019660109", "60019660110", "60019660111", "60019660112", "60019660113",
"60019660114", "60019660115", "60019660116", "60019660117", "60019660118",
"60019660119-10.6-12.4-11.9-11.6"), name1 = c("4.3", "7.4", "5.8",
"4.3", "-3.5-12.9", "-6.6-13.3", "-5.7", "-5.0-11.4", "-7.5-12.0",
"-8.8-15.3-11.5-19.5", "-9.8-16.4-13.1-22.3", "-8.9-17.4-10.9-20.0",
"-7.3", "-5.8-10.5", "-5.4-13.6", "-9.4-20.4-14.4-26.3", "-7.9-15.6-10.3-19.4",
"-8.7-11.2-10.5-16.0", "1.3"), name2 = c(".7", "3.8", "3.0",
"-4.1", "-8.6", "-8.6-16.3", "-7.5", "-8.9-11.0", "-9.6-17.6",
".0", ".6", "2.4", "-9.2", "-6.9", "-8.3", ".0", "1.2", ".8",
"34-99.0"), name3 = c("3.4", "5.5", "4.2", "-1.9", "-5.6", "6.1",
"-6.6", "1.8", "1.6", "20-99.0", "18", "17-99.0", "-8.5", "-8.0",
"-9.1", "33", "33-99.0", "34-99.0", "-.9"), name4 = c("1.0",
"1.9", "1.8", "-2.4", "1.5", "21-99.0", "-7.9", "25-99.0", "27-99.0",
"-.9", "1.5", "-.9", "-9.1", "6.1", ".1", "4.6", "-.9", "-.9",
"-.9"), name5 = c("1.0", "1.6", "10.9", "7.2", "17-99.0", "-.9",
"1.0", "-.9", "-.9", "-.9", "-.9", "-.9", "2.4", "25-99.0", "33-99.0",
"-.9", "-.9", "-.9", "-.9"), name6 = c("-9", "-9", "-9", "7-99.0",
"-.9", "-.9", "27-99.0", "-.9", "-.9", "-.9", "-.9", "-.9", "20-99.0",
"-.9", "-.9", "-.9", "-.9", "-.9", "-.9"), name7 = c(3.1, 3.7,
2.7, -0.9, -0.9, -0.9, -0.9, -0.9, -0.9, -0.9, -0.9, -0.9, -0.9,
-0.9, -0.9, -0.9, -0.9, -0.9, -0.9), name8 = c(-0.9, -0.9, -0.9,
-0.9, -0.9, -0.9, -0.9, -0.9, -0.9, -0.9, -0.9, -0.9, -0.9, -0.9,
-0.9, -0.9, -0.9, -0.9, NA), name9 = c(-0.9, -0.9, -0.9, -0.9,
-0.9, -0.9, -0.9, -0.9, -0.9, NA, -0.9, NA, -0.9, -0.9, -0.9,
-0.9, NA, NA, NA), name10 = c(-0.9, -0.9, -0.9, -0.9, -0.9, NA,
-0.9, NA, NA, NA, NA, NA, -0.9, -0.9, -0.9, NA, NA, NA, NA),
name11 = c(9.6, 7.8, 9, -0.9, NA, NA, -0.9, NA, NA, NA, NA,
NA, -0.9, NA, NA, NA, NA, NA, NA), name12 = c(-0.9, -0.9,
-0.9, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA)), .Names = c("id", "name1", "name2", "name3",
"name4", "name5", "name6", "name7", "name8", "name9", "name10",
"name11", "name12"), class = "data.frame", row.names = c(NA,
-19L))
这是我的(坏)输出:
id name1 name2 name3 name4 name5 name6 name7 name8 name9 name10 name11 name12
1 60019660101 4.3 .7 3.4 1.0 1.0 -9 3.1 -0.9 -0.9 -0.9 9.6 -0.9
2 60019660102 7.4 3.8 5.5 1.9 1.6 -9 3.7 -0.9 -0.9 -0.9 7.8 -0.9
3 60019660103 5.8 3.0 4.2 1.8 10.9 -9 2.7 -0.9 -0.9 -0.9 9.0 -0.9
4 60019660104 4.3 -4.1 -1.9 -2.4 7.2 7-99.0 -0.9 -0.9 -0.9 -0.9 -0.9 NA
5 60019660105 -3.5-12.9 -8.6 -5.6 1.5 17-99.0 -.9 -0.9 -0.9 -0.9 -0.9 NA NA
6 60019660106 -6.6-13.3 -8.6-16.3 6.1 21-99.0 -.9 -.9 -0.9 -0.9 -0.9 NA NA NA
7 60019660107 -5.7 -7.5 -6.6 -7.9 1.0 27-99.0 -0.9 -0.9 -0.9 -0.9 -0.9 NA
8 60019660108 -5.0-11.4 -8.9-11.0 1.8 25-99.0 -.9 -.9 -0.9 -0.9 -0.9 NA NA NA
9 60019660109 -7.5-12.0 -9.6-17.6 1.6 27-99.0 -.9 -.9 -0.9 -0.9 -0.9 NA NA NA
10 60019660110 -8.8-15.3-11.5-19.5 .0 20-99.0 -.9 -.9 -.9 -0.9 -0.9 NA NA NA NA
11 60019660111 -9.8-16.4-13.1-22.3 .6 18 1.5 -.9 -.9 -0.9 -0.9 -0.9 NA NA NA
12 60019660112 -8.9-17.4-10.9-20.0 2.4 17-99.0 -.9 -.9 -.9 -0.9 -0.9 NA NA NA NA
13 60019660113 -7.3 -9.2 -8.5 -9.1 2.4 20-99.0 -0.9 -0.9 -0.9 -0.9 -0.9 NA
14 60019660114 -5.8-10.5 -6.9 -8.0 6.1 25-99.0 -.9 -0.9 -0.9 -0.9 -0.9 NA NA
15 60019660115 -5.4-13.6 -8.3 -9.1 .1 33-99.0 -.9 -0.9 -0.9 -0.9 -0.9 NA NA
16 60019660116 -9.4-20.4-14.4-26.3 .0 33 4.6 -.9 -.9 -0.9 -0.9 -0.9 NA NA NA
17 60019660117 -7.9-15.6-10.3-19.4 1.2 33-99.0 -.9 -.9 -.9 -0.9 -0.9 NA NA NA NA
18 60019660118 -8.7-11.2-10.5-16.0 .8 34-99.0 -.9 -.9 -.9 -0.9 -0.9 NA NA NA NA
19 60019660119-10.6-12.4-11.9-11.6 1.3 34-99.0 -.9 -.9 -.9 -.9 -0.9 NA NA NA NA NA
这是正确数据的外观:
60019660101 4.3 .7 3.4 1.0 1.0 -9 3.1 -.9 -.9 -.9 9.6 -.9
60019660102 7.4 3.8 5.5 1.9 1.6 -9 3.7 -.9 -.9 -.9 7.8 -.9
60019660103 5.8 3.0 4.2 1.8 10.9 -9 2.7 -.9 -.9 -.9 9.0 -.9
60019660104 4.3 -4.1 -1.9 -2.4 7.2 7 -99.0 -.9 -.9 -.9 -.9 -.9
60019660105 -3.5 -12.9 -8.6 -5.6 1.5 17 -99.0 -.9 -.9 -.9 -.9 -.9
60019660106 -6.6 -13.3 -8.6 -16.3 6.1 21 -99.0 -.9 -.9 -.9 -.9 -.9
60019660107 -5.7 -7.5 -6.6 -7.9 1.0 27 -99.0 -.9 -.9 -.9 -.9 -.9
60019660108 -5.0 -11.4 -8.9 -11.0 1.8 25 -99.0 -.9 -.9 -.9 -.9 -.9
60019660109 -7.5 -12.0 -9.6 -17.6 1.6 27 -99.0 -.9 -.9 -.9 -.9 -.9
60019660110 -8.8 -15.3 -11.5 -19.5 .0 20 -99.0 -.9 -.9 -.9 -.9 -.9
60019660111 -9.8 -16.4 -13.1 -22.3 .6 18 1.5 -.9 -.9 -.9 -.9 -.9
60019660112 -8.9 -17.4 -10.9 -20.0 2.4 17 -99.0 -.9 -.9 -.9 -.9 -.9
60019660113 -7.3 -9.2 -8.5 -9.1 2.4 20 -99.0 -.9 -.9 -.9 -.9 -.9
60019660114 -5.8 -10.5 -6.9 -8.0 6.1 25 -99.0 -.9 -.9 -.9 -.9 -.9
60019660115 -5.4 -13.6 -8.3 -9.1 .1 33 -99.0 -.9 -.9 -.9 -.9 -.9
60019660116 -9.4 -20.4 -14.4 -26.3 .0 33 4.6 -.9 -.9 -.9 -.9 -.9
60019660117 -7.9 -15.6 -10.3 -19.4 1.2 33 -99.0 -.9 -.9 -.9 -.9 -.9
60019660118 -8.7 -11.2 -10.5 -16.0 .8 34 -99.0 -.9 -.9 -.9 -.9 -.9
60019660119 -10.6 -12.4 -11.9 -11.6 1.3 34 -99.0 -.9 -.9 -.9 -.9 -.9
答案 0 :(得分:5)
您似乎有固定宽度格式的数据。
read.fwf("https://dl.dropboxusercontent.com/u/74190377/data.txt",
widths=c(13,5,5,5,5,7,4,5,5,5,5,5,5))
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
#1 60019660101 4.3 0.7 3.4 1.0 1.0 -9 3.1 -0.9 -0.9 -0.9 9.6 -0.9
#2 60019660102 7.4 3.8 5.5 1.9 1.6 -9 3.7 -0.9 -0.9 -0.9 7.8 -0.9
#3 60019660103 5.8 3.0 4.2 1.8 10.9 -9 2.7 -0.9 -0.9 -0.9 9.0 -0.9
#4 60019660104 4.3 -4.1 -1.9 -2.4 7.2 7 -99.0 -0.9 -0.9 -0.9 -0.9 -0.9
#5 60019660105 -3.5 -12.9 -8.6 -5.6 1.5 17 -99.0 -0.9 -0.9 -0.9 -0.9 -0.9
<snip>
答案 1 :(得分:1)
我会对原始文件进行正则表达式修复。任何编辑器(甚至是MSoft Word!)都可以这样做:
找到“ - ”
REPLACE“(空格或制表符) - ”
更换所有
之后,read.table
应该可以正常工作。