仅当R中有三列时才读取数据

时间:2013-07-03 21:06:12

标签: r

我有一个文件,其中包含大量数据和文本。我想以这样一种方式读取文件,即我只保留带有三个坐标的线条。三个坐标指的是我有490353.36, 3755632.81, 109.73格式的行。换句话说,我想在表面线之后保留数据。数据在不同横截面处具有x,y和z坐标。

样本数据如下:

ENDSTREAMNETWORK:

BEGIN CROSS-SECTIONS:

  CROSS-SECTION:
    STREAM ID:Sipsey Fork     
    REACH ID:Sipsey Fork     
    STATION:13.60   
    NODE NAME:                
    CUT LINE:
      490353.358391478 , 3755632.80772044 
      490254.511677942 , 3755640.28160111 
      490229.8 , 3755642.15 
      490205.088314326 , 3755644.01839947 
      490130.953109393 , 3755649.62143546 
    SURFACE LINE:
     490353.36,   3755632.81,   109.73
     490341.00,   3755633.74,   103.63
     490331.74,   3755634.44,   97.54
     490276.13,   3755638.65,   91.44
     490263.78,   3755639.58,   85.34
     490254.51,   3755640.28,   79.25
     490254.51,   3755640.28,   79.25
     490242.16,   3755641.22,   75.59
     490229.80,   3755642.15,   75.59
     490217.44,   3755643.08,   75.59
     490205.09,   3755644.02,   79.25
     490205.09,   3755644.02,   79.25
     490186.55,   3755645.42,   85.34
     490177.29,   3755646.12,   91.44
     490158.75,   3755647.52,   97.54
     490146.40,   3755648.45,   103.63
     490130.95,   3755649.62,   109.73
  END:

  CROSS-SECTION:
    STREAM ID:Sipsey Fork     
    REACH ID:Sipsey Fork     
    STATION:13.552* 
    NODE NAME:                
    CUT LINE:
      490348.236792825 , 3755554.44864345 
      490248.581497463 , 3755561.99219479 
      490223.87626427 , 3755563.8637565 
      490199.171038808 , 3755565.73531763 
      490122.732478269 , 3755571.5258566 
    SURFACE LINE:
     490348.24,   3755554.45,   109.73
     490335.78,   3755555.39,   103.68
     490332.73,   3755555.62,   101.72
     490326.44,   3755556.10,   97.65
     490321.09,   3755556.50,   96.98
     490279.74,   3755559.63,   92.42
     490270.38,   3755560.34,   91.35
     490262.42,   3755560.94,   87.53
     490258.64,   3755561.23,   85.56
     490257.92,   3755561.29,   85.22
     490253.65,   3755561.61,   82.50
     490248.58,   3755561.99,   79.27
     490248.58,   3755561.99,   79.27
     490245.75,   3755562.21,   78.40
     490243.64,   3755562.37,   77.73
     490236.08,   3755562.94,   75.58
     490223.88,   3755563.86,   75.58
     490212.36,   3755564.74,   75.58
     490209.15,   3755564.98,   76.44
     490206.21,   3755565.20,   77.24
     490200.50,   3755565.63,   78.84
     490199.17,   3755565.74,   79.26
     490199.17,   3755565.74,   79.26
     490197.66,   3755565.85,   79.78
     490193.00,   3755566.20,   81.22
     490186.72,   3755566.68,   83.20
     490182.06,   3755567.03,   84.83
     490180.06,   3755567.18,   85.47
     490170.51,   3755567.91,   91.44
     490170.23,   3755567.93,   91.52
     490151.40,   3755569.35,   97.45
     490141.55,   3755570.10,   102.06
     490138.66,   3755570.32,   103.48
     490133.49,   3755570.71,   105.53
     490122.73,   3755571.53,   109.73
  END:

我有如上所示的数千行。我只想用逗号分隔的三列编译所有数据,并将其保存为R中的数据帧。

上述数据集所需的样本输出如下。也应删除逗号

     490353.36,   3755632.81,   109.73
     490341.00,   3755633.74,   103.63
     490331.74,   3755634.44,   97.54
     490276.13,   3755638.65,   91.44
     490263.78,   3755639.58,   85.34
     490254.51,   3755640.28,   79.25
     490254.51,   3755640.28,   79.25
     490242.16,   3755641.22,   75.59
     490229.80,   3755642.15,   75.59
     490217.44,   3755643.08,   75.59
     490205.09,   3755644.02,   79.25
     490205.09,   3755644.02,   79.25
     490186.55,   3755645.42,   85.34
     490177.29,   3755646.12,   91.44
     490158.75,   3755647.52,   97.54
     490146.40,   3755648.45,   103.63
     490130.95,   3755649.62,   109.73
     490348.24,   3755554.45,   109.73
     490335.78,   3755555.39,   103.68
     490332.73,   3755555.62,   101.72
     490326.44,   3755556.10,   97.65
     490321.09,   3755556.50,   96.98
     490279.74,   3755559.63,   92.42
     490270.38,   3755560.34,   91.35
     490262.42,   3755560.94,   87.53
     490258.64,   3755561.23,   85.56
     490257.92,   3755561.29,   85.22
     490253.65,   3755561.61,   82.50
     490248.58,   3755561.99,   79.27
     490248.58,   3755561.99,   79.27
     490245.75,   3755562.21,   78.40
     490243.64,   3755562.37,   77.73
     490236.08,   3755562.94,   75.58
     490223.88,   3755563.86,   75.58
     490212.36,   3755564.74,   75.58
     490209.15,   3755564.98,   76.44
     490206.21,   3755565.20,   77.24
     490200.50,   3755565.63,   78.84
     490199.17,   3755565.74,   79.26
     490199.17,   3755565.74,   79.26
     490197.66,   3755565.85,   79.78
     490193.00,   3755566.20,   81.22
     490186.72,   3755566.68,   83.20
     490182.06,   3755567.03,   84.83
     490180.06,   3755567.18,   85.47
     490170.51,   3755567.91,   91.44
     490170.23,   3755567.93,   91.52
     490151.40,   3755569.35,   97.45
     490141.55,   3755570.10,   102.06
     490138.66,   3755570.32,   103.48
     490133.49,   3755570.71,   105.53
     490122.73,   3755571.53,   109.73

4 个答案:

答案 0 :(得分:3)

首先使用readLines阅读文本文件,我会做这样的事情:

tt <- readLines("myfile.txt")
pat <- "^[ ]*(.*),(.*),(.*)[ ]*$"
tt <- gsub(pat, "\\1,\\2,\\3", grep(pat, tt, value=TRUE))
dat <- read.table(textConnection(tt), sep=",", header=FALSE)

这个想法:首先我们在tt中读取整个文件,以便我们可以完成所有必需的更改,过滤所需的行等。然后我们需要选择要保留哪些行以及丢弃哪些行。为此,我们构建一个模式 0-任意数量的空格,后跟任何后跟,后跟任何后跟,后跟任何后跟0-任意数量的空格。这将确保您只获得由,分隔的3列的行。因此,首先我们将此patgrep一起使用来过滤这些行,并仅保留与模式匹配的行(使用value=TRUE)。然后我们使用gsub来移除空格并保留,之间的内容(我认为不是绝对必要的,但确定没有坏处)。然后,我们现在拥有了我们需要的数据。我们所要做的就是将其传递给textConnection并像往常一样使用read.table阅读。希望这会有所帮助。

线条已经分开了。只需逐个输入这些行并查看输出,您就应该能够立即理解它。

答案 1 :(得分:3)

这太难看了我几乎都没发布过。但是,它有效。我读了你的数据,如:

raw<-read.table(textConnection('NDSTREAMNETWORK:

BEGIN CROSS-SECTIONS:

  CROSS-SECTION:
    STREAM ID:Sipsey Fork     
    REACH ID:Sipsey Fork     
    STATION:13.60   
    NODE NAME:                
    CUT LINE:
      490353.358391478 , 3755632.80772044 
      490254.511677942 , 3755640.28160111 
      490229.8 , 3755642.15 
      490205.088314326 , 3755644.01839947 
      490130.953109393 , 3755649.62143546 
    SURFACE LINE:
     490353.36,   3755632.81,   109.73
     490341.00,   3755633.74,   103.63
     490331.74,   3755634.44,   97.54
     490276.13,   3755638.65,   91.44
     490263.78,   3755639.58,   85.34
     490254.51,   3755640.28,   79.25
     490254.51,   3755640.28,   79.25
     490242.16,   3755641.22,   75.59
     490229.80,   3755642.15,   75.59
     490217.44,   3755643.08,   75.59
     490205.09,   3755644.02,   79.25
     490205.09,   3755644.02,   79.25
     490186.55,   3755645.42,   85.34
     490177.29,   3755646.12,   91.44
     490158.75,   3755647.52,   97.54
     490146.40,   3755648.45,   103.63
     490130.95,   3755649.62,   109.73
  END:

  CROSS-SECTION:
    STREAM ID:Sipsey Fork     
    REACH ID:Sipsey Fork     
    STATION:13.552* 
    NODE NAME:                
    CUT LINE:
      490348.236792825 , 3755554.44864345 
      490248.581497463 , 3755561.99219479 
      490223.87626427 , 3755563.8637565 
      490199.171038808 , 3755565.73531763 
      490122.732478269 , 3755571.5258566 
    SURFACE LINE:
     490348.24,   3755554.45,   109.73
     490335.78,   3755555.39,   103.68
     490332.73,   3755555.62,   101.72
     490326.44,   3755556.10,   97.65
     490321.09,   3755556.50,   96.98
     490279.74,   3755559.63,   92.42
     490270.38,   3755560.34,   91.35
     490262.42,   3755560.94,   87.53
     490258.64,   3755561.23,   85.56
     490257.92,   3755561.29,   85.22
     490253.65,   3755561.61,   82.50
     490248.58,   3755561.99,   79.27
     490248.58,   3755561.99,   79.27
     490245.75,   3755562.21,   78.40
     490243.64,   3755562.37,   77.73
     490236.08,   3755562.94,   75.58
     490223.88,   3755563.86,   75.58
     490212.36,   3755564.74,   75.58
     490209.15,   3755564.98,   76.44
     490206.21,   3755565.20,   77.24
     490200.50,   3755565.63,   78.84
     490199.17,   3755565.74,   79.26
     490199.17,   3755565.74,   79.26
     490197.66,   3755565.85,   79.78
     490193.00,   3755566.20,   81.22
     490186.72,   3755566.68,   83.20
     490182.06,   3755567.03,   84.83
     490180.06,   3755567.18,   85.47
     490170.51,   3755567.91,   91.44
     490170.23,   3755567.93,   91.52
     490151.40,   3755569.35,   97.45
     490141.55,   3755570.10,   102.06
     490138.66,   3755570.32,   103.48
     490133.49,   3755570.71,   105.53
     490122.73,   3755571.53,   109.73
  END:'),sep='\n',stringsAsFactors=FALSE)

然后我把它变成data.frame

vec<-unlist(raw)

start<-grep('SURFACE LINE:',vec)+1
end<-grep('END:',vec)-1

data<-do.call(rbind,
lapply(seq_along(start), 
  function(x) read.table(textConnection(vec[start[x]:end[x]])))
)

答案 2 :(得分:2)

不是最短的,但对我来说更容易理解

raw_text <- "ENDSTREAMNETWORK:

BEGIN CROSS-SECTIONS:

  CROSS-SECTION:
    STREAM ID:Sipsey Fork     
    REACH ID:Sipsey Fork     
    STATION:13.60   
    NODE NAME:                
    CUT LINE:
      490353.358391478 , 3755632.80772044 
      490254.511677942 , 3755640.28160111 
      490229.8 , 3755642.15 
      490205.088314326 , 3755644.01839947 
      490130.953109393 , 3755649.62143546 
    SURFACE LINE:
     490353.36,   3755632.81,   109.73
     490341.00,   3755633.74,   103.63
     490331.74,   3755634.44,   97.54
     490276.13,   3755638.65,   91.44
     490263.78,   3755639.58,   85.34
     490254.51,   3755640.28,   79.25
     490254.51,   3755640.28,   79.25
     490242.16,   3755641.22,   75.59
     490229.80,   3755642.15,   75.59
     490217.44,   3755643.08,   75.59
     490205.09,   3755644.02,   79.25
     490205.09,   3755644.02,   79.25
     490186.55,   3755645.42,   85.34
     490177.29,   3755646.12,   91.44
     490158.75,   3755647.52,   97.54
     490146.40,   3755648.45,   103.63
     490130.95,   3755649.62,   109.73
  END:"

以下是步骤

## read the data
raw_data <- readLines(textConnection(raw_text))

## split by ","
split_list <- strsplit(raw_data, ",")

## check for 3 columns
data <- split_list[sapply(split_list, length) == 3]

## remove space and ","
data <- lapply(data, function(x) gsub("\\s+|\\,", "", x))

## bind the data 
do.call("rbind", data)


##       [,1]        [,2]         [,3]    
##  [1,] "490353.36" "3755632.81" "109.73"
##  [2,] "490341.00" "3755633.74" "103.63"
##  [3,] "490331.74" "3755634.44" "97.54" 
##  [4,] "490276.13" "3755638.65" "91.44" 
##  [5,] "490263.78" "3755639.58" "85.34" 
##  [6,] "490254.51" "3755640.28" "79.25" 
##  [7,] "490254.51" "3755640.28" "79.25" 
##  [8,] "490242.16" "3755641.22" "75.59" 
##  [9,] "490229.80" "3755642.15" "75.59" 
## [10,] "490217.44" "3755643.08" "75.59" 
## [11,] "490205.09" "3755644.02" "79.25" 
## [12,] "490205.09" "3755644.02" "79.25" 
## [13,] "490186.55" "3755645.42" "85.34" 
## [14,] "490177.29" "3755646.12" "91.44" 
## [15,] "490158.75" "3755647.52" "97.54" 
## [16,] "490146.40" "3755648.45" "103.63"
## [17,] "490130.95" "3755649.62" "109.73"

答案 3 :(得分:0)

我想建议另一种方法。正如@dickoa所指出的,如果你是linux或mac用户,你可以使用第三方程序,如awkegrep为你做过滤。没有必要在R之外手动进行过滤,您可以通过单个system调用来完成。这两项工作都是:

@dickoa建议使用awk

read.table(text = system("awk '{FS = \",\"} {if (NF == 3) print}' test.txt",
                         intern = TRUE),
           sep = ',')

使用egrep

read.table(text = system("egrep '^[^,]+,[^,]+,[^,]+$' test.txt", intern = TRUE),
           sep = ',')

这样做的好处是它不会强制R将所有数据读入内存,如果您从非常大的文件中读取数据,这可能会有所不同。它也比其他建议的答案短。