配置嵌套的if循环以对数据集进行分类

时间:2013-12-26 13:20:45

标签: python

我有一个包含以下数据的文件:

line   EF1    1     F     Flu   5.7     3.221   9.332
line   A2     1     C     Car   3.2     5.22    1.22
line   A1     1     C     Car   3.11    4.21    2.13
line   HF1    1     H     Hyd   7.11    5.11    7.11
line   EE2    1     F     Flu   5.7     3.221   9.332
line   A2     2     C     Car   3.2     5.22    1.22
line   EF1    2     F     Flu   5.7     3.221   9.332
line   EE2    2     F     Flu   5.7     3.221   9.332
line   A1     2     C     Car   3.11    4.21    2.13
line   HE2    2     H     Hyd   7.11    5.11    7.11

...... 1000多行。

此处第3列表示链编号。 现在我创建了名为EFEEHace的不同列表。 我想要做的是,如果EF1HE1都来自同一chain number,那么请在EF data'EF list'HE dataH list 1}}。另一方面,如果同一'EF1'中仅存在HE1但不存在chain number,则将其写入'ace list'

所需的输出是:

EF list: line  EF1   1  F   Flu   5.7   3.221   9.332
         line  EE2   2  F   Flu   5.7   3.221   9.332

H list: line   HF1   1  H   Hyd   7.11 5.11    7.11
        line   HE2   2  H   Hyd   7.11 5.11    7.11

ace list: line   EE2   1  F   Flu   5.7   3.221   9.332
          line   EF1   2  F   Flu   5.7   3.221   9.332

现在我想尝试,

inp = filename.read().strip().split('\n')
for line in map(str.split,inp):
    codeName = line[1]
    shortName = line[3]

现在作为一个新手,我真的迷失在这里,我怎么能够构建一个if loop来完成这项检查。
请提供一些关于我如何在这方面取得进展的想法!! (我第一次误认为是格式化错误。纠正了它!)

4 个答案:

答案 0 :(得分:1)

您的代码需要看起来更像这样:

with open(filename) as inp:
    for line in inp:
        tokens = line.split()
        codeName = tokens[1]
        shortName = tokens[3]

你根本没有打开文件,地图()也没有真正帮助你。

答案 1 :(得分:0)

我认为你真的不想在那里使用for循环......

inp = filename.read().strip().split('\n')
inp = [line.split() for line in inp]
# sorts the input lines by the second column, so that groups appear together in order
inp = sorted(inp, key=itemgetter(1))

现在你有一个行列表,分成列

接下来你想把它们分成第二列相同的行块,对吗?

cur_group = 1
groups_list = []
group = []
for line in inp:
    if int(line[1]) == cur_group:
        group.append(line)
    else:
        groups_list.append(group)
        group = [line]
        cur_group += 1

现在你有一个组列表,每个组都是一个行列表;您的数据如下:

groups_list = [[['EF1',   '1',     'F',     'Flu',   '5.7',     '3.221',   '9.332'],
                 ...],
               [['A2',    '2', ... ],
                 ...],
               ...
              ]

现在您可以查看您真正想知道的内容,即每个组中的每个EF是否都有匹配的EH。我将为此创建一个辅助函数:

def find_match(line, group, EH_list, EF_list):
    """
    returns false if no match found, returns true and appends line and match to appropriate lists otherwise
    """
    for pmatch in group:
        if line[0].startswith('EH') and pmatch[0].startswith('EF') and pmatch[0][2]==line[0][2]: # Match case 1
            EH_list.append(line)
            EF_list.append(pmatch)
            return True
        elif line[0].startswith('EF') and pmatch[0].startswith('EH') and pmatch[0][2]==line[0][2]: # Match case 2
            EF_list.append(line)
            EH_list.append(pmatch)
            return True
        else:
            return False

然后其余部分简单而且相当直接:

for group in groups_list:
    for line in group:
        if line[0][0] == 'E' and not find_match(line, group, EH, EF):
            ace.append(line)

......我认为应该是它!我不会做出任何承诺,这个代码将立即运行,但它应该至少给你一个开始的好地方

答案 2 :(得分:0)

您说订单是随机的,因此您无法通过逐行浏览文件来查看是否存在一对条目。相反,你需要记住所有未配对的条目,以便在看到其他部分后可以检查它的存在。

所以我会记住词典中不匹配的条目,其中Ex / Hx部分和链号是关键。如果对于一行,配对条目在不匹配的字典中,我可以从那里删除它并将其添加到正确的列表中。否则我将该行添加到字典本身。

所有未被结束消费的不匹配条目自动成为ace的条目。

# initialize target lists and unmatched dictionary
E_list, H_list = [], []
unmatched = {}

with open('filename') as f:
    # loop over each line in the file
    for line in f:
        # split the line into parts separated by whitespace
        data = tuple(line.split())

        # `key` is the second column, `chain` the third
        key, chain = data[1], data[2]

        # `key` begins with an 'E'
        if key.startswith('E'):
            # The key of the paired value begins with an 'H' instead of 'E'
            pairKey = 'H' + key[1:]

            # The list for the current item will be `E_list`; the list for
            # the paired element will be `H_list`
            curList, pairList = E_list, H_list

        # `key` begins with an 'H'
        elif key.startswith('H'):
            # The key of the paired value begins with an 'E' instead of 'H'
            pairKey = 'E' + key[1:]

            # The list for the current item will be `H_list`; the list for
            # the paired element will be `E_list`
            curList, pairList = H_list, E_list

        # `key` stats with neither 'E' nor 'H', so skip this line
        else:
            continue

        # At this point we know that the current line has `key` as its
        # key, and `chain` as its chain. The element that should be paired
        # with it has `pairKey` as its key, and also `chain` as its chain.
        # If we have matched the paired element before, it will be in the
        # `unmatched` dictionary; if that’s the case, put the current element
        # into the list `curList`, and the paired element into the list
        # `pairList`.

        # Look up the paired element from the unmatched dictionary
        pair = unmatched.get((pairKey, chain), None)

        if pair:
            # If we found it, append the current and paired element to their
            # correct list …
            curList.append(data)
            pairList.append(pair)

            # … and remove the paired element from the unmatched set
            del unmatched[(pairKey, chain)]
        else:
            # Otherwise, if we didn’t found it, remember this item to be
            # paired with something later
            unmatched[(key, chain)] = data

# Finally, collect all elements that haven’t been matched yet, and
# put it into the `ace` list
ace = list(unmatched.values())

与您的示例数据一起使用,产生:

>>> for l in E_list: print(l)
('line', 'EF1', '1', 'F', 'Flu', '5.7', '3.221', '9.332')
('line', 'EE2', '2', 'F', 'Flu', '5.7', '3.221', '9.332')
>>> for l in H_list: print(l)
('line', 'HF1', '1', 'H', 'Hyd', '7.11', '5.11', '7.11')
('line', 'HE2', '2', 'H', 'Hyd', '7.11', '5.11', '7.11')
>>> for l in ace: print(l)
('line', 'EF1', '2', 'F', 'Flu', '5.7', '3.221', '9.332')
('line', 'EE2', '1', 'F', 'Flu', '5.7', '3.221', '9.332')

我从您的描述中不确定的一件事是“密钥”(EF / EH +编号)和链编号是否足以唯一地识别该对。如果情况并非如此,那么您可能想要更改我用作字典键的数据类型。

答案 3 :(得分:0)

首先通过文件:

我认为关键部分是如何识别哪种形式对。我使用了以下代码片段:

splitline = line.strip().split()
identifier = "".join(splitline[1:3])[1:]
# You could also write the following, if that makes it more clear: 
identifier = splitline[1][1:] + splitline[2]

基本上"".join(splitline[1:3])[1:]做的是从HF1 1制作标识符字符串F11(它省略了第一个字符),如果它只出现在那里,它基本上应该被测试的东西“F”或“H”(反之亦然)。

在示例中,EF1 with chain 1HF1 with chain 1都会生成该标识符F11。第一次出现其中一个时,它设置categories['F11'] = 1。当找到该对时,它会设置categories['F11'] = 2

使用它,我们构建一个字典来存储这些标识符的结果,如果它们出现一次或两次。

建立了类别字典后,我们可以再次浏览文件:

如果在categories中标识符的值为1,那么我们知道该行应放在ace中,如果值为2,我们知道应该将条目写入FH

当我们使用词典时,这个解决方案会非常快;如果你想维护列表中的顺序,请告诉我,然后我可以相应地更新。

所以这是代码:

infile = "overflow.txt"
result = {"F" : [], "H" : [], "ace" : []}
with open(infile) as f:
    # First pass through to build the dictionary of identifiers
    categories = {}
    for line in f:
        splitline = line.strip().split()
        identifier = "".join(splitline[1:3])[1:]
        if identifier not in categories:
            categories[identifier] = 1
        else:
            # If the identifier is already there, the value becomes 2.    
            categories[identifier] = 2
    # To go for a second pass through to create the lists
    f.seek(0)
    for line in f:
        splitline = line.strip().split()
        identifier = "".join(splitline[1:3])[1:]
        if categories[identifier] == 1:  # meaning the stem just occurs once
            if splitline[3] != "C":
                result["ace"].append(line.strip())
        else:
            result[splitline[3]].append(line.strip())

您现在可以result["H"]result["F"]result["ace"]访问这些群组。

以下是打印结果的代码:

for type in result:
    print("\n",type, "list:", "\n", "\n ".join(result[type]))
H list: 
  line   HF1    1     H     Hyd   7.11    5.11    7.11
  line   HE2    2     H     Hyd   7.11    5.11    7.11

F list: 
  line   EF1    1     F     Flu   5.7     3.221   9.332
  line   EE2    2     F     Flu   5.7     3.221   9.332

ace list: 
  line   EE2    1     F     Flu   5.7     3.221   9.332
  line   EF1    2     F     Flu   5.7     3.221   9.332