如何将数据文件的某些行读入R中

时间:2016-06-20 12:50:13

标签: r line

我有一个包含40,000多行的大型数据文件。它是一个日志输入列表,看起来有点像这样:

    D 20160602 14:15:43.559 F7982D62 Req Agr:131 Mra:0 Exp:0 Mxr:0 Mnr:0 Mxd:0 Mnd:0 Nro:0      
    D 20160602 14:15:43.559 F7982D62 Set Agr:130 Mra:0 Exp:0 Mxr:0 Mnr:0 Mxd:0 Mnd:0 Nro:0 I 20160602 14:15:43.559 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz" "" 
    M 20160602 14:15:43.595 DOC1: F7982D62 Request for unencrypted meta data on encrypted transaction
    M 20160602 14:15:48.353 DOC1: F7982D62 Transaction has been acknowledged at 722875647 
    F 20160602 14:15:48.398 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz" "" 50725464 (4,32) "Remote Application: Session Aborted: Aborted by user interrupt" 
    M 20160602 14:15:48.780 DOC1: F7982D63 New download request D 20160602 14:15:48.780 F7982D63 META: 134 Path: /pcgc/public/CTD/exome/fastq/PCGC0033175_HS_EX__1-00304-01__v1_FCBC0RE4ACXX_L3_p32of96_P2.fastq.gz user: xqixh8sl pack: arg: feat: cE,s

由于它太大了,我不想把整个东西读进记忆中。我只需要以行标识符开头的行" F"并且有一个(0,0)错误,如下所示:

    F 20160602 14:25:11.321 F7982D50 GET 156.145.15.85:37525 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0077248_HS_EX__1-06808__v3_FCC49HJACXX_L7_p1of1_P1.fastq.gz" "" 3322771022 (0,0) "1499.61 seconds (17.7 megabits/sec)"

我可以忽略的其他一切。我的问题是这样的:我想要一种逐行读取这个文件的方法,并评估它是否需要保留输入行。目前,我使用for循环遍历每一行并使用readLines()函数。它看起来像这样:

library(stringr)
con <- file("dataSet.txt", open = "r")
Fdata <- data.frame
i <- 1
j <- 1
lineLength <- length(readLines(con))
for (i in 1:lineLength){
  line <- readLines("dataSet.txt", 1)
  if (str_sub(line, 1, 1) == 'F' && grepl("\\(0\\,0\\)", line)[i]){
    print(line)
    Fdata[j,] <- rbind(line)
    i <- i + 1
    j <- j + 1
  }
  i <- i + 1
}
print(Fdata)

它运行良好,但它给我的输出不是我想要的。它只是一遍又一遍地打印文件的第一行。

    [1] "C 20160525 05:27:47.915 Rotated log file: /var/log/servedat-201605250527.log"
    [1] "C 20160525 05:27:47.915 Rotated log file: /var/log/servedat-201605250527.log"
    [1] "C 20160525 05:27:47.915 Rotated log file: /var/log/servedat-201605250527.log"
    [1] "C 20160525 05:27:47.915 Rotated log file: /var/log/servedat-201605250527.log"

如何判断我是否需要该线,以及如何正确存储它(如矢量,数据框,矩阵,它并不重要)以便我可以打印它在for循环之外?

更新

我已将代码更改为:

    library(stringr)
    con <- file("dataSet.txt", open = "r")
    Fdata <- data.frame
    i <- 1
    j <- 1
    lineLength <- length(readLines(con))
    for (i in 1:lineLength){
      line <- readLines(con, 1)
      print(line)
      if (str_sub(line, 1, 1) == 'F' && grepl("\\(0\\,0\\)", line)[i]){
        print(line)
        Fdata[j,] <- rbind(line)
        i <- i + 1
        j <- j + 1
      }
      i <- i + 1
    }
    print(Fdata)

然而,当我检查存储在行中的值时,它表示它是空的。我不明白为什么会改变。另外,它告诉我if语句没有正确的TRUE / FALSE条件,这也让我感到困惑,因为grepl()应该返回一个TRUE / FALSE值。

更新

我设法摆脱了这个错误,但是当我打电话给Fdata时,我仍然没有得到任何东西。我检查了我的变量,R说这行是空的,它没有字符。我错误地分配了吗?我希望line成为我在数据文件中解析的行并评估是否需要存储它。这是我更新的代码:

&#13;
&#13;
library(stringr)
con <- file("dataSet.txt", open = "r")
Fdata <- data.frame
i <- 1
j <- 1
lineLength <- length(readLines("dataSet.txt))
for (i in 1:lineLength){
  line <- readLines(con, 1)
  print(line)
  if (str_sub(line, 1, 1) == 'F' && grepl("\\(0\\,0\\)", line)){
    print(line)
    Fdata[j,] <- rbind(line)
    i <- i + 1
    j <- j + 1
  }
  i <- i + 1
}
print(Fdata) 
&#13;
&#13;
&#13;

3 个答案:

答案 0 :(得分:2)

检查出来:

con <- file("test1.txt", "r")
lines <- c()
while(TRUE) {
  line = readLines(con, 1)
  if(length(line) == 0) break
  else if(grepl("^\\s*F{1}", line) && grepl("(0,0)", line, fixed = TRUE)) lines <- c(lines, line)
}

lines
# [1] "F 20160602 14:25:11.321 F7982D50 GET 156.145.15.85:37525 xqixh8sl AES \"/pcgc/public/Other/exome/fastq/PCGC0077248_HS_EX__1-06808__v3_FCC49HJACXX_L7_p1of1_P1.fastq.gz\" \"\" 3322771022 (0,0) \"1499.61 seconds (17.7 megabits/sec)\""

将文件流传递给readLines,以便它可以逐行读取。使用正则表达式^\\s*F{1}来捕获以字母F开头的行,其中包含可能的空格,其中^表示字符串的开头。使用fixed=T捕获(0,0)的完全匹配。如果两个检查均为TRUE,请将结果附加到行。

数据

D 20160602 14:15:43.559 F7982D62 Req Agr:131 Mra:0 Exp:0 Mxr:0 Mnr:0 Mxd:0 Mnd:0 Nro:0      
D 20160602 14:15:43.559 F7982D62 Set Agr:130 Mra:0 Exp:0 Mxr:0 Mnr:0 Mxd:0 Mnd:0 Nro:0 I 20160602 14:15:43.559 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz" "" 
M 20160602 14:15:43.595 DOC1: F7982D62 Request for unencrypted meta data on encrypted transaction
M 20160602 14:15:48.353 DOC1: F7982D62 Transaction has been acknowledged at 722875647 
F 20160602 14:15:48.398 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz" "" 50725464 (4,32) "Remote Application: Session Aborted: Aborted by user interrupt" 
M 20160602 14:15:48.780 DOC1: F7982D63 New download request D 20160602 14:15:48.780 F7982D63 META: 134 Path: /pcgc/public/CTD/exome/fastq/PCGC0033175_HS_EX__1-00304-01__v1_FCBC0RE4ACXX_L3_p32of96_P2.fastq.gz user: xqixh8sl pack: arg: feat: cE,s
F 20160602 14:25:11.321 F7982D50 GET 156.145.15.85:37525 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0077248_HS_EX__1-06808__v3_FCC49HJACXX_L7_p1of1_P1.fastq.gz" "" 3322771022 (0,0) "1499.61 seconds (17.7 megabits/sec)"

答案 1 :(得分:1)

如果你有足够的内存,那么40万行不应该太多,R无法处理。出于性能原因,最好一次读取所有行,并使用矢量性能来分析结果。

您的代码可以简化为:

library(stringr)

line <- readLines("dataSet.txt")

foundset<-line[which(str_sub(line, 1, 1) == 'F' & grepl("(0,0)", line, fixed = TRUE))]
#rm("line")  #include this line to free up memory if there is a concern

这将读取以字母“F”开头的所有行和子集。所有这些行都在vector foundset中。

答案 2 :(得分:1)

这样的答案(What is a good way to read line-by-line in R?)也可以起作用:

cat('  D 20160602 14:15:43.559 F7982D62 Req Agr:131 Mra:0 Exp:0 Mxr:0 Mnr:0 Mxd:0 Mnd:0 Nro:0',      
    'D 20160602 14:15:43.559 F7982D62 Set Agr:130 Mra:0 Exp:0 Mxr:0 Mnr:0 Mxd:0 Mnd:0 Nro:0 I 20160602 14:15:43.559 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz" ""',
    'M 20160602 14:15:43.595 DOC1: F7982D62 Request for unencrypted meta data on encrypted transaction',
    'M 20160602 14:15:48.353 DOC1: F7982D62 Transaction has been acknowledged at 722875647',
    'F 20160602 14:15:48.398 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz" "" 50725464 (4,32) "Remote Application: Session Aborted: Aborted by user interrupt"',
    'M 20160602 14:15:48.780 DOC1: F7982D63 New download request D 20160602 14:15:48.780 F7982D63 META: 134 Path: /pcgc/public/CTD/exome/fastq/PCGC0033175_HS_EX__1-00304-01__v1_FCBC0RE4ACXX_L3_p32of96_P2.fastq.gz user: xqixh8sl pack: arg: feat: cE,s")',
    'F 20160602 14:15:48.398 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz" "" 50725464 (4,32) "Remote Application: Session Aborted: Aborted by user interrupt" (0,0)',
    file="test",
    sep="\n")


library(stringr)
con  <- file("test", open = "r")
res<-c()

while (length(oneLine <- readLines(con, n = 1, warn = FALSE)) > 0) {
  if (substr(str_trim(oneLine),1,1) =="F" & (regexpr("(0,0)",oneLine)[1] > 0) ){

    res<-c(res,oneLine)
  } 

} 

close(con)
res
[1] "F 20160602 14:15:48.398 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES \"/pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz\" \"\" 50725464 (4,32) \"Remote Application: Session Aborted: Aborted by user interrupt\" (0,0)"

请注意,我在那里添加了最后一行,以显示while循环的工作原理。