我有一个包含40,000多行的大型数据文件。它是一个日志输入列表,看起来有点像这样:
D 20160602 14:15:43.559 F7982D62 Req Agr:131 Mra:0 Exp:0 Mxr:0 Mnr:0 Mxd:0 Mnd:0 Nro:0
D 20160602 14:15:43.559 F7982D62 Set Agr:130 Mra:0 Exp:0 Mxr:0 Mnr:0 Mxd:0 Mnd:0 Nro:0 I 20160602 14:15:43.559 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz" ""
M 20160602 14:15:43.595 DOC1: F7982D62 Request for unencrypted meta data on encrypted transaction
M 20160602 14:15:48.353 DOC1: F7982D62 Transaction has been acknowledged at 722875647
F 20160602 14:15:48.398 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz" "" 50725464 (4,32) "Remote Application: Session Aborted: Aborted by user interrupt"
M 20160602 14:15:48.780 DOC1: F7982D63 New download request D 20160602 14:15:48.780 F7982D63 META: 134 Path: /pcgc/public/CTD/exome/fastq/PCGC0033175_HS_EX__1-00304-01__v1_FCBC0RE4ACXX_L3_p32of96_P2.fastq.gz user: xqixh8sl pack: arg: feat: cE,s
由于它太大了,我不想把整个东西读进记忆中。我只需要以行标识符开头的行" F"并且有一个(0,0)错误,如下所示:
F 20160602 14:25:11.321 F7982D50 GET 156.145.15.85:37525 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0077248_HS_EX__1-06808__v3_FCC49HJACXX_L7_p1of1_P1.fastq.gz" "" 3322771022 (0,0) "1499.61 seconds (17.7 megabits/sec)"
我可以忽略的其他一切。我的问题是这样的:我想要一种逐行读取这个文件的方法,并评估它是否需要保留输入行。目前,我使用for
循环遍历每一行并使用readLines()
函数。它看起来像这样:
library(stringr)
con <- file("dataSet.txt", open = "r")
Fdata <- data.frame
i <- 1
j <- 1
lineLength <- length(readLines(con))
for (i in 1:lineLength){
line <- readLines("dataSet.txt", 1)
if (str_sub(line, 1, 1) == 'F' && grepl("\\(0\\,0\\)", line)[i]){
print(line)
Fdata[j,] <- rbind(line)
i <- i + 1
j <- j + 1
}
i <- i + 1
}
print(Fdata)
它运行良好,但它给我的输出不是我想要的。它只是一遍又一遍地打印文件的第一行。
[1] "C 20160525 05:27:47.915 Rotated log file: /var/log/servedat-201605250527.log"
[1] "C 20160525 05:27:47.915 Rotated log file: /var/log/servedat-201605250527.log"
[1] "C 20160525 05:27:47.915 Rotated log file: /var/log/servedat-201605250527.log"
[1] "C 20160525 05:27:47.915 Rotated log file: /var/log/servedat-201605250527.log"
如何判断我是否需要该线,以及如何正确存储它(如矢量,数据框,矩阵,它并不重要)以便我可以打印它在for循环之外?
更新
我已将代码更改为:
library(stringr)
con <- file("dataSet.txt", open = "r")
Fdata <- data.frame
i <- 1
j <- 1
lineLength <- length(readLines(con))
for (i in 1:lineLength){
line <- readLines(con, 1)
print(line)
if (str_sub(line, 1, 1) == 'F' && grepl("\\(0\\,0\\)", line)[i]){
print(line)
Fdata[j,] <- rbind(line)
i <- i + 1
j <- j + 1
}
i <- i + 1
}
print(Fdata)
然而,当我检查存储在行中的值时,它表示它是空的。我不明白为什么会改变。另外,它告诉我if语句没有正确的TRUE / FALSE条件,这也让我感到困惑,因为grepl()应该返回一个TRUE / FALSE值。
更新
我设法摆脱了这个错误,但是当我打电话给Fdata时,我仍然没有得到任何东西。我检查了我的变量,R说这行是空的,它没有字符。我错误地分配了吗?我希望line成为我在数据文件中解析的行并评估是否需要存储它。这是我更新的代码:
library(stringr)
con <- file("dataSet.txt", open = "r")
Fdata <- data.frame
i <- 1
j <- 1
lineLength <- length(readLines("dataSet.txt))
for (i in 1:lineLength){
line <- readLines(con, 1)
print(line)
if (str_sub(line, 1, 1) == 'F' && grepl("\\(0\\,0\\)", line)){
print(line)
Fdata[j,] <- rbind(line)
i <- i + 1
j <- j + 1
}
i <- i + 1
}
print(Fdata)
&#13;
答案 0 :(得分:2)
检查出来:
con <- file("test1.txt", "r")
lines <- c()
while(TRUE) {
line = readLines(con, 1)
if(length(line) == 0) break
else if(grepl("^\\s*F{1}", line) && grepl("(0,0)", line, fixed = TRUE)) lines <- c(lines, line)
}
lines
# [1] "F 20160602 14:25:11.321 F7982D50 GET 156.145.15.85:37525 xqixh8sl AES \"/pcgc/public/Other/exome/fastq/PCGC0077248_HS_EX__1-06808__v3_FCC49HJACXX_L7_p1of1_P1.fastq.gz\" \"\" 3322771022 (0,0) \"1499.61 seconds (17.7 megabits/sec)\""
将文件流传递给readLines
,以便它可以逐行读取。使用正则表达式^\\s*F{1}
来捕获以字母F
开头的行,其中包含可能的空格,其中^
表示字符串的开头。使用fixed=T
捕获(0,0)
的完全匹配。如果两个检查均为TRUE
,请将结果附加到行。
数据强>:
D 20160602 14:15:43.559 F7982D62 Req Agr:131 Mra:0 Exp:0 Mxr:0 Mnr:0 Mxd:0 Mnd:0 Nro:0
D 20160602 14:15:43.559 F7982D62 Set Agr:130 Mra:0 Exp:0 Mxr:0 Mnr:0 Mxd:0 Mnd:0 Nro:0 I 20160602 14:15:43.559 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz" ""
M 20160602 14:15:43.595 DOC1: F7982D62 Request for unencrypted meta data on encrypted transaction
M 20160602 14:15:48.353 DOC1: F7982D62 Transaction has been acknowledged at 722875647
F 20160602 14:15:48.398 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz" "" 50725464 (4,32) "Remote Application: Session Aborted: Aborted by user interrupt"
M 20160602 14:15:48.780 DOC1: F7982D63 New download request D 20160602 14:15:48.780 F7982D63 META: 134 Path: /pcgc/public/CTD/exome/fastq/PCGC0033175_HS_EX__1-00304-01__v1_FCBC0RE4ACXX_L3_p32of96_P2.fastq.gz user: xqixh8sl pack: arg: feat: cE,s
F 20160602 14:25:11.321 F7982D50 GET 156.145.15.85:37525 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0077248_HS_EX__1-06808__v3_FCC49HJACXX_L7_p1of1_P1.fastq.gz" "" 3322771022 (0,0) "1499.61 seconds (17.7 megabits/sec)"
答案 1 :(得分:1)
如果你有足够的内存,那么40万行不应该太多,R无法处理。出于性能原因,最好一次读取所有行,并使用矢量性能来分析结果。
您的代码可以简化为:
library(stringr)
line <- readLines("dataSet.txt")
foundset<-line[which(str_sub(line, 1, 1) == 'F' & grepl("(0,0)", line, fixed = TRUE))]
#rm("line") #include this line to free up memory if there is a concern
这将读取以字母“F”开头的所有行和子集。所有这些行都在vector foundset中。
答案 2 :(得分:1)
这样的答案(What is a good way to read line-by-line in R?)也可以起作用:
cat(' D 20160602 14:15:43.559 F7982D62 Req Agr:131 Mra:0 Exp:0 Mxr:0 Mnr:0 Mxd:0 Mnd:0 Nro:0',
'D 20160602 14:15:43.559 F7982D62 Set Agr:130 Mra:0 Exp:0 Mxr:0 Mnr:0 Mxd:0 Mnd:0 Nro:0 I 20160602 14:15:43.559 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz" ""',
'M 20160602 14:15:43.595 DOC1: F7982D62 Request for unencrypted meta data on encrypted transaction',
'M 20160602 14:15:48.353 DOC1: F7982D62 Transaction has been acknowledged at 722875647',
'F 20160602 14:15:48.398 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz" "" 50725464 (4,32) "Remote Application: Session Aborted: Aborted by user interrupt"',
'M 20160602 14:15:48.780 DOC1: F7982D63 New download request D 20160602 14:15:48.780 F7982D63 META: 134 Path: /pcgc/public/CTD/exome/fastq/PCGC0033175_HS_EX__1-00304-01__v1_FCBC0RE4ACXX_L3_p32of96_P2.fastq.gz user: xqixh8sl pack: arg: feat: cE,s")',
'F 20160602 14:15:48.398 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz" "" 50725464 (4,32) "Remote Application: Session Aborted: Aborted by user interrupt" (0,0)',
file="test",
sep="\n")
library(stringr)
con <- file("test", open = "r")
res<-c()
while (length(oneLine <- readLines(con, n = 1, warn = FALSE)) > 0) {
if (substr(str_trim(oneLine),1,1) =="F" & (regexpr("(0,0)",oneLine)[1] > 0) ){
res<-c(res,oneLine)
}
}
close(con)
res
[1] "F 20160602 14:15:48.398 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES \"/pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz\" \"\" 50725464 (4,32) \"Remote Application: Session Aborted: Aborted by user interrupt\" (0,0)"
请注意,我在那里添加了最后一行,以显示while
循环的工作原理。