我需要遍历生物序列数据库GeneGene的巨大文本文件(对于内存来说太大了20GB),并为每个数据库条目提取相同的信息。每个条目都以行function objOfMatches(array1, array2, cb) {
var obj = {};
var newArray1 = array1.map(cb);
for (let i = 0; i < newArray1.length; i++) {
if( !(array2.indexOf(newArray1[i]) == -1) ) {
obj[newArray1[i].toLowerCase()] = newArray1[i];
}
}
return obj;
}
console.log(objOfMatches(['hi', 'howdy', 'bye', 'later', 'hello'], ['HI', 'Howdy', 'BYE', 'LATER', 'hello'], function(str) { return str.toUpperCase(); }));
// should log: { hi: 'HI', bye: 'BYE', later: 'LATER' }
开始,以行LOCUS XYZ some more text
结尾。例如:
//
现在,有一种方法可以告诉python iterativeley将该文件的相应3个块读取到某个变量var中。更准确地说:
迭代1:var =
LOCUS 123 some more text
many lines of some more text
many lines of some more text
many lines of some more text
//
LOCUS 231 some more text
many lines of some more text
many lines of some more text
many lines of some more text
//
LOCUS 312 some more text
many lines of some more text
many lines of some more text
many lines of some more text
//
迭代2:var =
LOCUS 123 some more text
many lines of some more text
many lines of some more text
many lines of some more text
//
迭代3:var =
LOCUS 231 some more text
many lines of some more text
many lines of some more text
many lines of some more text
//
在此先感谢您,并为即将到来的假期致以最大的感谢
答案 0 :(得分:0)
假设我们有以下文本文件:
LOCUS 421 bla bla ba
Lorem ipsum dolor sit amet,
consectetur adipiscing elit.
Duis eu erat orci. Quisque
nec augue ultricies, dignissim
neque id, feugiat risus.
//
LOCUS 421 blabla
Nullam pulvinar quis ante
at condimentum.
//
我们可以做到:
is_processing = True
pf = open("somefile.txt", "r")
# Handles chunks
while True:
first_chunk_line = True
chunk_lines = []
# Handles one chunk
while True:
data_line = pf.readline()
# detect the end of the file
if data_line == '':
is_processing = False
break
# Detect first line
if first_chunk_line:
if "LOCUS" not in data_line:
raise Exception("Data file is malformed!")
first_chunk_line = False
continue # don't process the line
# Detect end of locus / chunk
if data_line.strip() == "//":
break
# if it is neither a first line, and end line nor the end of the file
# then it must be a chunk line holding precious DNA information
chunk_lines.append(data_line)
# end the while loop
if not is_processing:
break
# do something with one chunk lines
print(chunk_lines)