如何使用grep使用R从文本文件中提取表格的这一部分?

时间:2018-12-12 22:08:38

标签: r extract

我有一个文件“ prf003.out”,

 150 lines of blah....~tables that report other things in this text file deleted.....



     Aboveground Live    Belowground                        Forest             Total    Total     Carbon
     ----------------- -----------------    Stand  -------------------------    Stand  Removed   Released
YEAR    Total    Merch     Live     Dead     Dead      DDW    Floor  Shb/Hrb   Carbon   Carbon  from Fire
--------------------------------------------------------------------------------------------------------------
2000     15.6     15.6      6.0      0.5      0.0      4.5      2.6      0.0     29.1      0.0        0.0
2001     15.6     15.6      6.0      0.4      0.0      4.2      2.5      0.0     28.7      0.0        0.0
2002     15.6     15.6      6.0      0.4      0.0      3.9      2.5      0.0     28.4      0.0        0.0
2003     15.6     15.6      6.0      0.4      0.0      3.7      2.5      0.0     28.1      0.0        0.0
2004     15.6     15.6      6.0      0.4      0.0      3.5      2.5      0.0     27.9      0.0        0.0
2005     16.6     16.6      6.0      1.0      1.3      3.6      2.5      0.0     30.9      0.0        0.0
2006     16.6     16.6      6.0      0.9      0.8      3.8      2.4      0.0     30.6      0.0        0.0
2007     16.6     16.6      6.0      0.9      0.6      3.8      2.4      0.0     30.3      0.0        0.0
2008     16.6     16.6      6.0      0.9      0.4      3.7      2.4      0.0     30.0      0.0        0.0
2009     16.6     16.6      6.0      0.8      0.2      3.7      2.4      0.0     29.8      0.0        0.0
2010     18.1     18.1      6.3      1.2      1.0      3.8      2.4      0.0     32.8      0.0        0.0
2011     18.1     18.1      6.3      1.1      0.6      4.0      2.4      0.0     32.5      0.0        0.0
2012     18.1     18.1      6.3      1.1      0.4      3.9      2.4      0.0     32.2      0.0        0.0
2013     18.1     18.1      6.3      1.0      0.3      3.9      2.4      0.0     31.9      0.0        0.0
2014     18.1     18.1      6.3      1.0      0.2      3.8      2.4      0.0     31.7      0.0        0.0
2015     19.1     19.1      6.5      1.4      1.1      3.9      2.4      0.0     34.3      0.0        0.0
2016     19.1     19.1      6.5      1.3      0.7      4.1      2.4      0.0     34.0      0.0        0.0
2017     19.1     19.1      6.5      1.3      0.5      4.0      2.4      0.0     33.8      0.0        0.0
2018     19.1     19.1      6.5      1.2      0.3      4.0      2.4      0.0     33.5      0.0        0.0
2019     19.1     19.1      6.5      1.2      0.2      3.9      2.4      0.0     33.2      0.0        0.0
2020     19.0     19.0      6.3      1.9      1.8      4.2      2.4      0.0     35.6      0.0        0.0
2021     19.0     19.0      6.3      1.8      1.3      4.5      2.4      0.0     35.3      0.0        0.0
2022     19.0     19.0      6.3      1.7      1.0      4.6      2.4      0.0     35.0      0.0        0.0
2023     19.0     19.0      6.3      1.6      0.7      4.6      2.4      0.0     34.7      0.0        0.0
2024     19.0     19.0      6.3      1.6      0.5      4.6      2.4      0.0     34.4      0.0        0.0
2025     19.0     19.0      6.3      2.2      2.0      4.9      2.4      0.0     36.7      0.0        0.0
2026     19.0     19.0      6.3      2.1      1.3      5.3      2.4      0.0     36.4      0.0        0.0
2027     19.0     19.0      6.3      2.0      1.0      5.4      2.4      0.0     36.0      0.0        0.0
2028     19.0     19.0      6.3      1.9      0.7      5.4      2.4      0.0     35.7      0.0        0.0
2029     19.0     19.0      6.3      1.9      0.5      5.4      2.4      0.0     35.4      0.0        0.0
2030     19.4     19.4      6.5      2.2      1.4      5.6      2.4      0.0     37.5      0.0        0.0
2031     19.4     19.4      6.5      2.1      0.8      5.9      2.4      0.0     37.2      0.0        0.0
2032     19.4     19.4      6.5      2.0      0.6      5.9      2.4      0.0     36.8      0.0        0.0
2033     19.4     19.4      6.5      1.9      0.4      5.8      2.4      0.0     36.5      0.0        0.0
2034     19.4     19.4      6.5      1.9      0.3      5.7      2.4      0.0     36.1      0.0        0.0
2035     18.6     18.6      6.3      2.6      2.1      6.0      2.4      0.0     38.0      0.0        0.0
2036     18.6     18.6      6.3      2.5      1.5      6.4      2.4      0.0     37.6      0.0        0.0
2037     18.6     18.6      6.3      2.4      1.1      6.4      2.4      0.0     37.2      0.0        0.0
2038     18.6     18.6      6.3      2.3      0.8      6.5      2.4      0.0     36.9      0.0        0.0
2039     18.6     18.6      6.3      2.2      0.6      6.5      2.4      0.0     36.5      0.0        0.0
2040     19.4     19.4      6.7      2.3      1.0      6.6      2.4      0.0     38.3      0.0        0.0
2041     19.4     19.4      6.7      2.2      0.6      6.6      2.4      0.0     38.0      0.0        0.0
2042     19.4     19.4      6.7      2.1      0.5      6.5      2.4      0.0     37.6      0.0        0.0
2043     19.4     19.4      6.7      2.0      0.4      6.4      2.4      0.0     37.3      0.0        0.0
2044     19.4     19.4      6.7      2.0      0.3      6.3      2.4      0.0     36.9      0.0        0.0
2045     17.9     17.9      6.3      2.8      2.5      6.6      2.4      0.0     38.5      0.0        0.0
2046     17.9     17.9      6.3      2.7      1.8      7.0      2.4      0.0     38.1      0.0        0.0
2047     17.9     17.9      6.3      2.6      1.4      7.1      2.4      0.0     37.7      0.0        0.0
2048     17.9     17.9      6.3      2.5      1.0      7.2      2.4      0.0     37.3      0.0        0.0
2049     17.9     17.9      6.3      2.4      0.7      7.2      2.4      0.0     36.9      0.0        0.0


blah.....a few more tables
我试图从中提取此特定表的

。如您所见,顶部的“ blah”代表此.txt文件中生成的其他所有表。 之后,我在同一文件中输出了一堆其他表。

我试图做的事情与此问题类似,但是现在我陷入了困境:Extracting Data from Text Files

这是我所做的:

data <- readLines("prf003.out")
data
#VALUE=TRUE RETURNS EXACT MATCH OF TEXT.
cline <- grep("YEAR    Total    Merch     Live     Dead     Dead      DDW    Floor  Shb/Hrb   Carbon   Carbon  from Fire", data, value= FALSE)
cline

#dont use str_extract, use str_extract_all
numstr <- sapply(str_extract_all(data[cline+1:51],"[0-9]"),as.numeric)
numstr

但是,我得到的输出很奇怪,并且没有按照我想要的方式格式化数据(即,只给我一个原始表的副本,以便我可以在R中处理它)

    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37] [,38] [,39]
 [1,]    2    2    2    2    2    2    2    2    2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2
 [2,]    0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
 [3,]    0    0    0    0    0    0    0    0    0     0     1     1     1     1     1     1     1     1     1     1     2     2     2     2     2     2     2     2     2     2     3     3     3     3     3     3     3     3     3
 [4,]    0    1    2    3    4    5    6    7    8     9     0     1     2     3     4     5     6     7     8     9     0     1     2     3     4     5     6     7     8     9     0     1     2     3     4     5     6     7     8
 [5,]    1    1    1    1    1    1    1    1    1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
 [6,]    5    5    5    5    5    6    6    6    6     6     8     8     8     8     8     9     9     9     9     9     9     9     9     9     9     9     9     9     9     9     9     9     9     9     9     8     8     8     8
 [7,]    6    6    6    6    6    6    6    6    6     6     1     1     1     1     1     1     1     1     1     1     0     0     0     0     0     0     0     0     0     0     4     4     4     4     4     6     6     6     6
 [8,]    1    1    1    1    1    1    1    1    1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
 [9,]    5    5    5    5    5    6    6    6    6     6     8     8     8     8     8     9     9     9     9     9     9     9     9     9     9     9     9     9     9     9     9     9     9     9     9     8     8     8     8
[10,]    6    6    6    6    6    6    6    6    6     6     1     1     1     1     1     1     1     1     1     1     0     0     0     0     0     0     0     0     0     0     4     4     4     4     4     6     6     6     6
[11,]    6    6    6    6    6    6    6    6    6     6     6     6     6     6     6     6     6     6     6     6     6     6     6     6     6     6     6     6     6     6     6     6     6     6     6     6     6     6     6
[12,]    0    0    0    0    0    0    0    0    0     0     3     3     3     3     3     5     5     5     5     5     3     3     3     3     3     3     3     3     3     3     5     5     5     5     5     3     3     3     3
[13,]    0    0    0    0    0    1    0    0    0     0     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     2     2     2     1     1     2     2     2     1     1     2     2     2     2
[14,]    5    4    4    4    4    0    9    9    9     8     2     1     1     0     0     4     3     3     2     2     9     8     7     6     6     2     1     0     9     9     2     1     0     9     9     6     5     4     3
[15,]    0    0    0    0    0    1    0    0    0     0     1     0     0     0     0     1     0     0     0     0     1     1     1     0     0     2     1     1     0     0     1     0     0     0     0     2     1     1     0
[16,]    0    0    0    0    0    3    8    6    4     2     0     6     4     3     2     1     7     5     3     2     8     3     0     7     5     0     3     0     7     5     4     8     6     4     3     1     5     1     8
[17,]    4    4    3    3    3    3    3    3    3     3     3     4     3     3     3     3     4     4     4     3     4     4     4     4     4     4     5     5     5     5     5     5     5     5     5     6     6     6     6
[18,]    5    2    9    7    5    6    8    8    7     7     8     0     9     9     8     9     1     0     0     9     2     5     6     6     6     9     3     4     4     4     6     9     9     8     7     0     4     4     5
[19,]    2    2    2    2    2    2    2    2    2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2     2
[20,]    6    5    5    5    5    5    4    4    4     4     4     4     4     4     4     4     4     4     4     4     4     4     4     4     4     4     4     4     4     4     4     4     4     4     4     4     4     4     4
[21,]    0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
[22,]    0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
[23,]    2    2    2    2    2    3    3    3    3     2     3     3     3     3     3     3     3     3     3     3     3     3     3     3     3     3     3     3     3     3     3     3     3     3     3     3     3     3     3
[24,]    9    8    8    8    7    0    0    0    0     9     2     2     2     1     1     4     4     3     3     3     5     5     5     4     4     6     6     6     5     5     7     7     6     6     6     8     7     7     6
[25,]    1    7    4    1    9    9    6    3    0     8     8     5     2     9     7     3     0     8     5     2     6     3     0     7     4     7     4     0     7     4     5     2     8     5     1     0     6     2     9
[26,]    0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
[27,]    0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
[28,]    0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
[29,]    0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0

如您所见,它采用每个数字的每个值并将其放在新坐标中。我只想要原始表。

1 个答案:

答案 0 :(得分:0)

这样的事情

# figure out where the headers are & where the data starts
dataHeader1 <- which(grepl("Aboveground", txtFile))
dataHeader2 <- dataHeader1 + 2
dataStart <- dataHeader2 + 2
# extract the data
txtDat <- txtFile[dataStart:length(txtFile)]
txtDat <- do.call(rbind, strsplit(txtDat, split = "\\s{1,}", perl = TRUE))
class(txtDat) <- "numeric"
txtDat
# returns
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
 [1,] 2000 15.6 15.6  6.0  0.5  0.0  4.5  2.6    0  29.1     0     0
 [2,] 2001 15.6 15.6  6.0  0.4  0.0  4.2  2.5    0  28.7     0     0
 [3,] 2002 15.6 15.6  6.0  0.4  0.0  3.9  2.5    0  28.4     0     0
 [4,] 2003 15.6 15.6  6.0  0.4  0.0  3.7  2.5    0  28.1     0     0
 [5,] 2004 15.6 15.6  6.0  0.4  0.0  3.5  2.5    0  27.9     0     0
 [6,] 2005 16.6 16.6  6.0  1.0  1.3  3.6  2.5    0  30.9     0     0
 ....

请注意,您可以对regex进行锐化以便确定数据的起始位置,例如。

dataHeader1 <- which(grepl("(?=.*Aboveground)(?=.*Carbon)", txtFile, perl = TRUE))
# this can be pursued arbitrarily

。我通过txtFile <- readLines("Path/To/test.txt")读取数据,原始数据本身看起来像这样

 [1] "asdsalkjdaskldas+"                                                                                             
 [2] "jsafhnjadfnhdjkasfafdajfbnjasbfjads.kbnjdasnfadsnf"                                                            
 [3] "45453342542542kj             ijholijfkqaef45435314"                                                            
 [4] ""                                                                                                              
 [5] "dasfjasikedfnha4454 "                                                                                          
 [6] "a"                                                                                                             
 [7] "a"                                                                                                             
 [8] "fdgfd"                                                                                                         
 [9] "\t\t6546346343"                                                                                                  
[10] ""                                                                                                              
[11] ""                                                                                                              
[12] "     Aboveground Live    Belowground                        Forest              Total    Total     Carbon"      
[13] "     ----------------- -----------------    Stand  -------------------------    Stand  Removed   Released"     
[14] "YEAR    Total    Merch     Live     Dead     Dead      DDW    Floor  Shb/Hrb   Carbon   Carbon  from Fire"     
[15] "--------------------------------------------------------------------------------------------------------------"
[16] "2000     15.6     15.6      6.0      0.5      0.0      4.5      2.6      0.0     29.1      0.0        0.0"  
...