我有一个文件,其中包含大量数据和文本。我想以这样一种方式读取文件,即我只保留带有三个坐标的线条。三个坐标指的是我有490353.36, 3755632.81, 109.73
格式的行。换句话说,我想在表面线之后保留数据。数据在不同横截面处具有x,y和z坐标。
样本数据如下:
ENDSTREAMNETWORK:
BEGIN CROSS-SECTIONS:
CROSS-SECTION:
STREAM ID:Sipsey Fork
REACH ID:Sipsey Fork
STATION:13.60
NODE NAME:
CUT LINE:
490353.358391478 , 3755632.80772044
490254.511677942 , 3755640.28160111
490229.8 , 3755642.15
490205.088314326 , 3755644.01839947
490130.953109393 , 3755649.62143546
SURFACE LINE:
490353.36, 3755632.81, 109.73
490341.00, 3755633.74, 103.63
490331.74, 3755634.44, 97.54
490276.13, 3755638.65, 91.44
490263.78, 3755639.58, 85.34
490254.51, 3755640.28, 79.25
490254.51, 3755640.28, 79.25
490242.16, 3755641.22, 75.59
490229.80, 3755642.15, 75.59
490217.44, 3755643.08, 75.59
490205.09, 3755644.02, 79.25
490205.09, 3755644.02, 79.25
490186.55, 3755645.42, 85.34
490177.29, 3755646.12, 91.44
490158.75, 3755647.52, 97.54
490146.40, 3755648.45, 103.63
490130.95, 3755649.62, 109.73
END:
CROSS-SECTION:
STREAM ID:Sipsey Fork
REACH ID:Sipsey Fork
STATION:13.552*
NODE NAME:
CUT LINE:
490348.236792825 , 3755554.44864345
490248.581497463 , 3755561.99219479
490223.87626427 , 3755563.8637565
490199.171038808 , 3755565.73531763
490122.732478269 , 3755571.5258566
SURFACE LINE:
490348.24, 3755554.45, 109.73
490335.78, 3755555.39, 103.68
490332.73, 3755555.62, 101.72
490326.44, 3755556.10, 97.65
490321.09, 3755556.50, 96.98
490279.74, 3755559.63, 92.42
490270.38, 3755560.34, 91.35
490262.42, 3755560.94, 87.53
490258.64, 3755561.23, 85.56
490257.92, 3755561.29, 85.22
490253.65, 3755561.61, 82.50
490248.58, 3755561.99, 79.27
490248.58, 3755561.99, 79.27
490245.75, 3755562.21, 78.40
490243.64, 3755562.37, 77.73
490236.08, 3755562.94, 75.58
490223.88, 3755563.86, 75.58
490212.36, 3755564.74, 75.58
490209.15, 3755564.98, 76.44
490206.21, 3755565.20, 77.24
490200.50, 3755565.63, 78.84
490199.17, 3755565.74, 79.26
490199.17, 3755565.74, 79.26
490197.66, 3755565.85, 79.78
490193.00, 3755566.20, 81.22
490186.72, 3755566.68, 83.20
490182.06, 3755567.03, 84.83
490180.06, 3755567.18, 85.47
490170.51, 3755567.91, 91.44
490170.23, 3755567.93, 91.52
490151.40, 3755569.35, 97.45
490141.55, 3755570.10, 102.06
490138.66, 3755570.32, 103.48
490133.49, 3755570.71, 105.53
490122.73, 3755571.53, 109.73
END:
我有如上所示的数千行。我只想用逗号分隔的三列编译所有数据,并将其保存为R中的数据帧。
上述数据集所需的样本输出如下。也应删除逗号
490353.36, 3755632.81, 109.73
490341.00, 3755633.74, 103.63
490331.74, 3755634.44, 97.54
490276.13, 3755638.65, 91.44
490263.78, 3755639.58, 85.34
490254.51, 3755640.28, 79.25
490254.51, 3755640.28, 79.25
490242.16, 3755641.22, 75.59
490229.80, 3755642.15, 75.59
490217.44, 3755643.08, 75.59
490205.09, 3755644.02, 79.25
490205.09, 3755644.02, 79.25
490186.55, 3755645.42, 85.34
490177.29, 3755646.12, 91.44
490158.75, 3755647.52, 97.54
490146.40, 3755648.45, 103.63
490130.95, 3755649.62, 109.73
490348.24, 3755554.45, 109.73
490335.78, 3755555.39, 103.68
490332.73, 3755555.62, 101.72
490326.44, 3755556.10, 97.65
490321.09, 3755556.50, 96.98
490279.74, 3755559.63, 92.42
490270.38, 3755560.34, 91.35
490262.42, 3755560.94, 87.53
490258.64, 3755561.23, 85.56
490257.92, 3755561.29, 85.22
490253.65, 3755561.61, 82.50
490248.58, 3755561.99, 79.27
490248.58, 3755561.99, 79.27
490245.75, 3755562.21, 78.40
490243.64, 3755562.37, 77.73
490236.08, 3755562.94, 75.58
490223.88, 3755563.86, 75.58
490212.36, 3755564.74, 75.58
490209.15, 3755564.98, 76.44
490206.21, 3755565.20, 77.24
490200.50, 3755565.63, 78.84
490199.17, 3755565.74, 79.26
490199.17, 3755565.74, 79.26
490197.66, 3755565.85, 79.78
490193.00, 3755566.20, 81.22
490186.72, 3755566.68, 83.20
490182.06, 3755567.03, 84.83
490180.06, 3755567.18, 85.47
490170.51, 3755567.91, 91.44
490170.23, 3755567.93, 91.52
490151.40, 3755569.35, 97.45
490141.55, 3755570.10, 102.06
490138.66, 3755570.32, 103.48
490133.49, 3755570.71, 105.53
490122.73, 3755571.53, 109.73
答案 0 :(得分:3)
首先使用readLines
阅读文本文件,我会做这样的事情:
tt <- readLines("myfile.txt")
pat <- "^[ ]*(.*),(.*),(.*)[ ]*$"
tt <- gsub(pat, "\\1,\\2,\\3", grep(pat, tt, value=TRUE))
dat <- read.table(textConnection(tt), sep=",", header=FALSE)
这个想法:首先我们在tt
中读取整个文件,以便我们可以完成所有必需的更改,过滤所需的行等。然后我们需要选择要保留哪些行以及丢弃哪些行。为此,我们构建一个模式 0-任意数量的空格,后跟任何后跟,
后跟任何后跟,
后跟任何后跟0-任意数量的空格}的空格。这将确保您只获得由,
分隔的3列的行。因此,首先我们将此pat
与grep
一起使用来过滤这些行,并仅保留与模式匹配的行(使用value=TRUE
)。然后我们使用gsub
来移除空格并保留,
之间的内容(我认为不是绝对必要的,但确定没有坏处)。然后,我们现在拥有了我们需要的数据。我们所要做的就是将其传递给textConnection
并像往常一样使用read.table
阅读。希望这会有所帮助。
线条已经分开了。只需逐个输入这些行并查看输出,您就应该能够立即理解它。
答案 1 :(得分:3)
这太难看了我几乎都没发布过。但是,它有效。我读了你的数据,如:
raw<-read.table(textConnection('NDSTREAMNETWORK:
BEGIN CROSS-SECTIONS:
CROSS-SECTION:
STREAM ID:Sipsey Fork
REACH ID:Sipsey Fork
STATION:13.60
NODE NAME:
CUT LINE:
490353.358391478 , 3755632.80772044
490254.511677942 , 3755640.28160111
490229.8 , 3755642.15
490205.088314326 , 3755644.01839947
490130.953109393 , 3755649.62143546
SURFACE LINE:
490353.36, 3755632.81, 109.73
490341.00, 3755633.74, 103.63
490331.74, 3755634.44, 97.54
490276.13, 3755638.65, 91.44
490263.78, 3755639.58, 85.34
490254.51, 3755640.28, 79.25
490254.51, 3755640.28, 79.25
490242.16, 3755641.22, 75.59
490229.80, 3755642.15, 75.59
490217.44, 3755643.08, 75.59
490205.09, 3755644.02, 79.25
490205.09, 3755644.02, 79.25
490186.55, 3755645.42, 85.34
490177.29, 3755646.12, 91.44
490158.75, 3755647.52, 97.54
490146.40, 3755648.45, 103.63
490130.95, 3755649.62, 109.73
END:
CROSS-SECTION:
STREAM ID:Sipsey Fork
REACH ID:Sipsey Fork
STATION:13.552*
NODE NAME:
CUT LINE:
490348.236792825 , 3755554.44864345
490248.581497463 , 3755561.99219479
490223.87626427 , 3755563.8637565
490199.171038808 , 3755565.73531763
490122.732478269 , 3755571.5258566
SURFACE LINE:
490348.24, 3755554.45, 109.73
490335.78, 3755555.39, 103.68
490332.73, 3755555.62, 101.72
490326.44, 3755556.10, 97.65
490321.09, 3755556.50, 96.98
490279.74, 3755559.63, 92.42
490270.38, 3755560.34, 91.35
490262.42, 3755560.94, 87.53
490258.64, 3755561.23, 85.56
490257.92, 3755561.29, 85.22
490253.65, 3755561.61, 82.50
490248.58, 3755561.99, 79.27
490248.58, 3755561.99, 79.27
490245.75, 3755562.21, 78.40
490243.64, 3755562.37, 77.73
490236.08, 3755562.94, 75.58
490223.88, 3755563.86, 75.58
490212.36, 3755564.74, 75.58
490209.15, 3755564.98, 76.44
490206.21, 3755565.20, 77.24
490200.50, 3755565.63, 78.84
490199.17, 3755565.74, 79.26
490199.17, 3755565.74, 79.26
490197.66, 3755565.85, 79.78
490193.00, 3755566.20, 81.22
490186.72, 3755566.68, 83.20
490182.06, 3755567.03, 84.83
490180.06, 3755567.18, 85.47
490170.51, 3755567.91, 91.44
490170.23, 3755567.93, 91.52
490151.40, 3755569.35, 97.45
490141.55, 3755570.10, 102.06
490138.66, 3755570.32, 103.48
490133.49, 3755570.71, 105.53
490122.73, 3755571.53, 109.73
END:'),sep='\n',stringsAsFactors=FALSE)
然后我把它变成data.frame
vec<-unlist(raw)
start<-grep('SURFACE LINE:',vec)+1
end<-grep('END:',vec)-1
data<-do.call(rbind,
lapply(seq_along(start),
function(x) read.table(textConnection(vec[start[x]:end[x]])))
)
答案 2 :(得分:2)
不是最短的,但对我来说更容易理解
raw_text <- "ENDSTREAMNETWORK:
BEGIN CROSS-SECTIONS:
CROSS-SECTION:
STREAM ID:Sipsey Fork
REACH ID:Sipsey Fork
STATION:13.60
NODE NAME:
CUT LINE:
490353.358391478 , 3755632.80772044
490254.511677942 , 3755640.28160111
490229.8 , 3755642.15
490205.088314326 , 3755644.01839947
490130.953109393 , 3755649.62143546
SURFACE LINE:
490353.36, 3755632.81, 109.73
490341.00, 3755633.74, 103.63
490331.74, 3755634.44, 97.54
490276.13, 3755638.65, 91.44
490263.78, 3755639.58, 85.34
490254.51, 3755640.28, 79.25
490254.51, 3755640.28, 79.25
490242.16, 3755641.22, 75.59
490229.80, 3755642.15, 75.59
490217.44, 3755643.08, 75.59
490205.09, 3755644.02, 79.25
490205.09, 3755644.02, 79.25
490186.55, 3755645.42, 85.34
490177.29, 3755646.12, 91.44
490158.75, 3755647.52, 97.54
490146.40, 3755648.45, 103.63
490130.95, 3755649.62, 109.73
END:"
以下是步骤
## read the data
raw_data <- readLines(textConnection(raw_text))
## split by ","
split_list <- strsplit(raw_data, ",")
## check for 3 columns
data <- split_list[sapply(split_list, length) == 3]
## remove space and ","
data <- lapply(data, function(x) gsub("\\s+|\\,", "", x))
## bind the data
do.call("rbind", data)
## [,1] [,2] [,3]
## [1,] "490353.36" "3755632.81" "109.73"
## [2,] "490341.00" "3755633.74" "103.63"
## [3,] "490331.74" "3755634.44" "97.54"
## [4,] "490276.13" "3755638.65" "91.44"
## [5,] "490263.78" "3755639.58" "85.34"
## [6,] "490254.51" "3755640.28" "79.25"
## [7,] "490254.51" "3755640.28" "79.25"
## [8,] "490242.16" "3755641.22" "75.59"
## [9,] "490229.80" "3755642.15" "75.59"
## [10,] "490217.44" "3755643.08" "75.59"
## [11,] "490205.09" "3755644.02" "79.25"
## [12,] "490205.09" "3755644.02" "79.25"
## [13,] "490186.55" "3755645.42" "85.34"
## [14,] "490177.29" "3755646.12" "91.44"
## [15,] "490158.75" "3755647.52" "97.54"
## [16,] "490146.40" "3755648.45" "103.63"
## [17,] "490130.95" "3755649.62" "109.73"
答案 3 :(得分:0)
我想建议另一种方法。正如@dickoa所指出的,如果你是linux或mac用户,你可以使用第三方程序,如awk
或egrep
为你做过滤。没有必要在R之外手动进行过滤,您可以通过单个system
调用来完成。这两项工作都是:
@dickoa建议使用awk
:
read.table(text = system("awk '{FS = \",\"} {if (NF == 3) print}' test.txt",
intern = TRUE),
sep = ',')
使用egrep
:
read.table(text = system("egrep '^[^,]+,[^,]+,[^,]+$' test.txt", intern = TRUE),
sep = ',')
这样做的好处是它不会强制R将所有数据读入内存,如果您从非常大的文件中读取数据,这可能会有所不同。它也比其他建议的答案短。