过滤txt文件是否满足python中的某些条件?

时间:2019-02-14 15:33:25

标签: python python-3.x

我有一个txt文件,其中包含 subjectid_num_ [dog / cat] _ [option]

ID1_0123_CAT_ANIMAL_3
ID1_0123_CAT_ANIMAL_GOOD_3
ID1_0123_ABC_3
ID2_1234_CAT_ANIMAL_3
ID2_1234_CAT_ANIMAL_GOOD_3
ID2_1234_DOG_ANIMAL_2
ID2_1234_DOG_ANIMAL_GOOD_0
ID2_1234_ABCD_3
ID3_4321_DOG_ANIMAL_1
ID3_4321_DOG_ANIMAL_GOOD_4
ID3_4321_DOG_3

我想过滤文件以获得满足条件的输出。例如,下面的代码将输出名称为CATGOOD的输出,文件名称为,并且不在其中包含DOGGOOD名字。名称由相同的subject_id和相同的数字num确定。但是,该代码未显示我的预期输出。我该如何解决?

这是我的代码

with open("./cat_dog.txt", 'r') as f:
    files_list = [line.rstrip('\n') for line in f]
file_filter = []
for i, cat in enumerate(files_list):
    if 'GOOD' in cat and 'CAT' in cat:
        subject_id = cat.split('_')[0]
        num_id = cat.split('_')[1]
        subject_num = subject_id + '_' + num_id
        for j, dog in enumerate(files_list):
                if subject_num in dog and 'GOOD' in dog:
                    if 'GOOD' in dog and 'DOG' in dog:
                        continue;
                    else:
                        file_filter.append(cat)

当前输出为

ID1_0123_CAT_ANIMAL_GOOD_3
ID2_1234_CAT_ANIMAL_GOOD_3

虽然期望是

ID1_0123_CAT_ANIMAL_GOOD_3

1 个答案:

答案 0 :(得分:1)

您的代码错误。考虑一下在内部循环中检查行ID2_1234_CAT_ANIMAL_GOOD_3时会发生什么:

subject_id = cat.split('_')[0]            #ID2
num_id = cat.split('_')[1]                # 1234
subject_num = subject_id + '_' + num_id   #ID2_1234
for j, dog in enumerate(files_list):
        # when dog is the line ID2_1234_CAT_ANIMAL_GOOD_3
        if subject_num in dog and 'GOOD' in dog:   # this is true
            if 'GOOD' in dog and 'DOG' in dog:   # this is false
                continue;
            else:
                file_filter.append(cat)   # then it outputs it

问题在于,其中每行包含GOODCAT的行都会在内部循环中“匹配”。

恕我直言,我会使用itertools.groupby。类似于:

from itertools import groupby

def key(line):
    return line.split('_')[:2]

for key, lines in groupby(sorted(files_list, key=key), key=key):
    good_lines = [line for line in lines if 'GOOD' in line]
    if len(good_lines) == 1 and 'CAT' in good_lines[0]:
        file_filter.append(good_lines[0])

尽管O(nlog n)需要RAM中文件的所有内容,但O(nlog n)比O(n ^ 2)效率更高。


如果您还有CATDOG以外的其他“类”,并且您想输出所有GOOD CAT行,除非subject_id也是{{1} } GOOD,您可以通过以下方式修改上面的代码:

DOG

(您需要使用is_good_cat = any('CAT' in line for line in good_lines) is_good_dog = any('DOG' in line for line in good_lines) if is_good_cat and not is_good_dog: file_filter.extend(line for line in good_lines if 'CAT' in good_lines) 和循环,因为我们不再知道要写的行,因此必须对其进行过滤。