抱歉,但我决定删除内容,因为我发布的内容不正确。
答案 0 :(得分:0)
在这里,我将如何改变你的风格:
with open("z:/file.txt", "rU") as file: # U flag means Universal Newline Mode,
# if error, try switching back to b
print(file.name)
counter = 0
for line in file: # iterate over a file object itself line by line
if line.lower().startswith('dna'): # look for your desired condition
# process the data
counter += 1
答案 1 :(得分:0)
所有变量都保存在内存中。您希望保留最近的匹配并进行比较,并在匹配时进行计数:
import csv
prefix = 'DNA'
with open('file.txt','rb') as csvfile:
# use csv to separate the fields, making it easier to deal with the
# first value without hard-coding its size
reader = csv.reader(csvfile, delimiter='\t')
match = None
count = 0
is_good = False
for row in reader:
# matching rows
if row[0].startswith(prefix):
if match is None:
# first line with prefix..
match = row[0]
if row[0] == match:
# found a match, so increment
count += 1
if row[0] != match:
# row prefix has changed
if 96 <= count < 100:
# counted enough, so start counting the next
match = row[0] # match on this now
count = 0 # reset the count
else:
# didn't count enough, so stop working through this file
break
# non-matching rows
else:
if match is None:
# ignore preceding lines in file
continue
else:
# found non-matching line when expecting a match
break
else:
if 96 <= count < 100:
# there was at least successful run of lines
is_good = True
if is_good:
print 'File was good'
else:
print 'File was bad'
答案 2 :(得分:0)
根据您的描述,您感兴趣的行与正则表达式匹配:
^DNA[0-9]{10}
也就是说,我假设您的 xyz 实际上是十个数字。
这里的策略是匹配13个字符的字符串。如果没有匹配,而且我们之前没有匹配,我们将继续坚持下去。一旦我们匹配,我们 保存字符串,并递增计数器。只要我们继续匹配正则表达式和保存的字符串,我们就会不断增加。一旦我们遇到不同的正则表达式匹配,或者根本没有匹配,相同匹配的序列就结束了。如果它有效,我们将计数重置为 零和最后一个匹配为空。如果它无效,我们退出。
我赶紧补充说以下是未经测试的。
# Input file with DNA lines to match:
infile = "z:/file.txt"
# This is the regex for the lines of interest:
regex = re.compile('^DNA[0-9]{10}')
# This will keep count of the number of matches in sequence:
n_seq = 0
# This is the previous match (if any):
lastmatch = ''
# Subroutine to check given sequence count and bail if bad:
def bail_on_bad_sequence(count, match):
if 96 <= count < 100:
return
sys.stderr.write("Bad count (%d) for '%s'\n" % (count,match))
sys.exit(1)
with open(infile) as file:
for line in file:
# Try to match the line to the regex:
match = re.match(line)
if match:
if match.group(0) == lastmatch:
n_seq += 1
else:
bail_on_bad_sequence(lastmatch, n_seq)
n_seq = 0
lastmatch = match.group(0)
else:
if n_seq != 0:
bail_on_bad_sequence(lastmatch, n_seq)
n_seq = 0
lastmatch = ''
答案 3 :(得分:0)
请忽略我上次查看代码的请求。我自己检查了一下,发现问题在于格式化。 它看起来现在按预期工作并分析目录中的所有文件。再次感谢Metthew。这帮助很大。我仍然对计算的准确性有一些担忧,因为在少数情况下它失败了,但它不应该...但我会对它进行调查。 总的来说......非常感谢大家的帮助。