Question

抱歉，但我决定删除内容，因为我发布的内容不正确。

Answer 1

在这里，我将如何改变你的风格：

with open("z:/file.txt", "rU") as file: # U flag means Universal Newline Mode, 
                                        # if error, try switching back to b
    print(file.name)        
    counter = 0
    for line in file: # iterate over a file object itself line by line
        if line.lower().startswith('dna'): # look for your desired condition
            # process the data
            counter += 1

Answer 2

所有变量都保存在内存中。您希望保留最近的匹配并进行比较，并在匹配时进行计数：

import csv

prefix = 'DNA'

with open('file.txt','rb') as csvfile:
    # use csv to separate the fields, making it easier to deal with the
    # first value without hard-coding its size
    reader = csv.reader(csvfile, delimiter='\t')
    match = None
    count = 0
    is_good = False
    for row in reader:
        # matching rows
        if row[0].startswith(prefix):

            if match is None:
                # first line with prefix..
                match = row[0]

            if row[0] == match:
                # found a match, so increment
                count += 1

            if row[0] != match:
                # row prefix has changed
                if 96 <= count < 100:
                    # counted enough, so start counting the next
                    match = row[0] # match on this now
                    count = 0 # reset the count
                else:
                    # didn't count enough, so stop working through this file
                    break

        # non-matching rows
        else:
            if match is None:
                # ignore preceding lines in file
                continue
            else:
                # found non-matching line when expecting a match
                break
    else:
        if 96 <= count < 100:
            # there was at least successful run of lines
            is_good = True

if is_good:
    print 'File was good'
else:
    print 'File was bad'

Answer 3

根据您的描述，您感兴趣的行与正则表达式匹配：

^DNA[0-9]{10}

也就是说，我假设您的 xyz 实际上是十个数字。

这里的策略是匹配13个字符的字符串。如果没有匹配，而且我们之前没有匹配，我们将继续坚持下去。一旦我们匹配，我们保存字符串，并递增计数器。只要我们继续匹配正则表达式和保存的字符串，我们就会不断增加。一旦我们遇到不同的正则表达式匹配，或者根本没有匹配，相同匹配的序列就结束了。如果它有效，我们将计数重置为零和最后一个匹配为空。如果它无效，我们退出。

我赶紧补充说以下是未经测试的。

# Input file with DNA lines to match:
infile = "z:/file.txt"

# This is the regex for the lines of interest:
regex = re.compile('^DNA[0-9]{10}')

# This will keep count of the number of matches in sequence:
n_seq = 0

# This is the previous match (if any):
lastmatch = ''

# Subroutine to check given sequence count and bail if bad:
def bail_on_bad_sequence(count, match):
    if 96 <= count < 100:
        return
    sys.stderr.write("Bad count (%d) for '%s'\n" % (count,match))
    sys.exit(1)


with open(infile) as file:
    for line in file:
        # Try to match the line to the regex:
        match = re.match(line)

        if match:
            if match.group(0) == lastmatch:
                n_seq += 1
            else:
                bail_on_bad_sequence(lastmatch, n_seq)
                n_seq = 0
                lastmatch = match.group(0)
        else:
            if n_seq != 0:
                bail_on_bad_sequence(lastmatch, n_seq)
                n_seq = 0
                lastmatch = ''

Answer 4

请忽略我上次查看代码的请求。我自己检查了一下，发现问题在于格式化。它看起来现在按预期工作并分析目录中的所有文件。再次感谢Metthew。这帮助很大。我仍然对计算的准确性有一些担忧，因为在少数情况下它失败了，但它不应该...但我会对它进行调查。总的来说......非常感谢大家的帮助。

如何将变量值临时保存在内存中并在python中进行比较...

4 个答案: