Question

我有一个包含以下数据的文件：

line   EF1    1     F     Flu   5.7     3.221   9.332
line   A2     1     C     Car   3.2     5.22    1.22
line   A1     1     C     Car   3.11    4.21    2.13
line   HF1    1     H     Hyd   7.11    5.11    7.11
line   EE2    1     F     Flu   5.7     3.221   9.332
line   A2     2     C     Car   3.2     5.22    1.22
line   EF1    2     F     Flu   5.7     3.221   9.332
line   EE2    2     F     Flu   5.7     3.221   9.332
line   A1     2     C     Car   3.11    4.21    2.13
line   HE2    2     H     Hyd   7.11    5.11    7.11

...... 1000多行。

此处第3列表示链编号。现在我创建了名为EF，EE，H和ace的不同列表。我想要做的是，如果EF1和HE1都来自同一chain number，那么请在EF data和'EF list'中HE data写H list 1}}。另一方面，如果同一'EF1'中仅存在HE1但不存在chain number，则将其写入'ace list'。

所需的输出是：

EF list: line  EF1   1  F   Flu   5.7   3.221   9.332
         line  EE2   2  F   Flu   5.7   3.221   9.332

H list: line   HF1   1  H   Hyd   7.11 5.11    7.11
        line   HE2   2  H   Hyd   7.11 5.11    7.11

ace list: line   EE2   1  F   Flu   5.7   3.221   9.332
          line   EF1   2  F   Flu   5.7   3.221   9.332

现在我想尝试，

inp = filename.read().strip().split('\n')
for line in map(str.split,inp):
    codeName = line[1]
    shortName = line[3]

现在作为一个新手，我真的迷失在这里，我怎么能够构建一个if loop来完成这项检查。
请提供一些关于我如何在这方面取得进展的想法!! （我第一次误认为是格式化错误。纠正了它！）

Answer 1

您的代码需要看起来更像这样：

with open(filename) as inp:
    for line in inp:
        tokens = line.split()
        codeName = tokens[1]
        shortName = tokens[3]

你根本没有打开文件，地图（）也没有真正帮助你。

Answer 2

我认为你真的不想在那里使用for循环......

inp = filename.read().strip().split('\n')
inp = [line.split() for line in inp]
# sorts the input lines by the second column, so that groups appear together in order
inp = sorted(inp, key=itemgetter(1))

现在你有一个行列表，分成列

接下来你想把它们分成第二列相同的行块，对吗？

cur_group = 1
groups_list = []
group = []
for line in inp:
    if int(line[1]) == cur_group:
        group.append(line)
    else:
        groups_list.append(group)
        group = [line]
        cur_group += 1

现在你有一个组列表，每个组都是一个行列表;您的数据如下：

groups_list = [[['EF1',   '1',     'F',     'Flu',   '5.7',     '3.221',   '9.332'],
                 ...],
               [['A2',    '2', ... ],
                 ...],
               ...
              ]

现在您可以查看您真正想知道的内容，即每个组中的每个EF是否都有匹配的EH。我将为此创建一个辅助函数：

def find_match(line, group, EH_list, EF_list):
    """
    returns false if no match found, returns true and appends line and match to appropriate lists otherwise
    """
    for pmatch in group:
        if line[0].startswith('EH') and pmatch[0].startswith('EF') and pmatch[0][2]==line[0][2]: # Match case 1
            EH_list.append(line)
            EF_list.append(pmatch)
            return True
        elif line[0].startswith('EF') and pmatch[0].startswith('EH') and pmatch[0][2]==line[0][2]: # Match case 2
            EF_list.append(line)
            EH_list.append(pmatch)
            return True
        else:
            return False

然后其余部分简单而且相当直接：

for group in groups_list:
    for line in group:
        if line[0][0] == 'E' and not find_match(line, group, EH, EF):
            ace.append(line)

......我认为应该是它！我不会做出任何承诺，这个代码将立即运行，但它应该至少给你一个开始的好地方

Answer 3

您说订单是随机的，因此您无法通过逐行浏览文件来查看是否存在一对条目。相反，你需要记住所有未配对的条目，以便在看到其他部分后可以检查它的存在。

所以我会记住词典中不匹配的条目，其中Ex / Hx部分和链号是关键。如果对于一行，配对条目在不匹配的字典中，我可以从那里删除它并将其添加到正确的列表中。否则我将该行添加到字典本身。

所有未被结束消费的不匹配条目自动成为ace的条目。

# initialize target lists and unmatched dictionary
E_list, H_list = [], []
unmatched = {}

with open('filename') as f:
    # loop over each line in the file
    for line in f:
        # split the line into parts separated by whitespace
        data = tuple(line.split())

        # `key` is the second column, `chain` the third
        key, chain = data[1], data[2]

        # `key` begins with an 'E'
        if key.startswith('E'):
            # The key of the paired value begins with an 'H' instead of 'E'
            pairKey = 'H' + key[1:]

            # The list for the current item will be `E_list`; the list for
            # the paired element will be `H_list`
            curList, pairList = E_list, H_list

        # `key` begins with an 'H'
        elif key.startswith('H'):
            # The key of the paired value begins with an 'E' instead of 'H'
            pairKey = 'E' + key[1:]

            # The list for the current item will be `H_list`; the list for
            # the paired element will be `E_list`
            curList, pairList = H_list, E_list

        # `key` stats with neither 'E' nor 'H', so skip this line
        else:
            continue

        # At this point we know that the current line has `key` as its
        # key, and `chain` as its chain. The element that should be paired
        # with it has `pairKey` as its key, and also `chain` as its chain.
        # If we have matched the paired element before, it will be in the
        # `unmatched` dictionary; if that’s the case, put the current element
        # into the list `curList`, and the paired element into the list
        # `pairList`.

        # Look up the paired element from the unmatched dictionary
        pair = unmatched.get((pairKey, chain), None)

        if pair:
            # If we found it, append the current and paired element to their
            # correct list …
            curList.append(data)
            pairList.append(pair)

            # … and remove the paired element from the unmatched set
            del unmatched[(pairKey, chain)]
        else:
            # Otherwise, if we didn’t found it, remember this item to be
            # paired with something later
            unmatched[(key, chain)] = data

# Finally, collect all elements that haven’t been matched yet, and
# put it into the `ace` list
ace = list(unmatched.values())

与您的示例数据一起使用，产生：

>>> for l in E_list: print(l)
('line', 'EF1', '1', 'F', 'Flu', '5.7', '3.221', '9.332')
('line', 'EE2', '2', 'F', 'Flu', '5.7', '3.221', '9.332')
>>> for l in H_list: print(l)
('line', 'HF1', '1', 'H', 'Hyd', '7.11', '5.11', '7.11')
('line', 'HE2', '2', 'H', 'Hyd', '7.11', '5.11', '7.11')
>>> for l in ace: print(l)
('line', 'EF1', '2', 'F', 'Flu', '5.7', '3.221', '9.332')
('line', 'EE2', '1', 'F', 'Flu', '5.7', '3.221', '9.332')

我从您的描述中不确定的一件事是“密钥”（EF / EH +编号）和链编号是否足以唯一地识别该对。如果情况并非如此，那么您可能想要更改我用作字典键的数据类型。

Answer 4

首先通过文件：

我认为关键部分是如何识别哪种形式对。我使用了以下代码片段：

splitline = line.strip().split()
identifier = "".join(splitline[1:3])[1:]
# You could also write the following, if that makes it more clear: 
identifier = splitline[1][1:] + splitline[2]

基本上"".join(splitline[1:3])[1:]做的是从HF1 1制作标识符字符串F11（它省略了第一个字符），如果它只出现在那里，它基本上应该被测试的东西“F”或“H”（反之亦然）。

在示例中，EF1 with chain 1和HF1 with chain 1都会生成该标识符F11。第一次出现其中一个时，它设置categories['F11'] = 1。当找到该对时，它会设置categories['F11'] = 2

使用它，我们构建一个字典来存储这些标识符的结果，如果它们出现一次或两次。

建立了类别字典后，我们可以再次浏览文件：

如果在categories中标识符的值为1，那么我们知道该行应放在ace中，如果值为2，我们知道应该将条目写入F或H。

当我们使用词典时，这个解决方案会非常快;如果你想维护列表中的顺序，请告诉我，然后我可以相应地更新。

所以这是代码：

infile = "overflow.txt"
result = {"F" : [], "H" : [], "ace" : []}
with open(infile) as f:
    # First pass through to build the dictionary of identifiers
    categories = {}
    for line in f:
        splitline = line.strip().split()
        identifier = "".join(splitline[1:3])[1:]
        if identifier not in categories:
            categories[identifier] = 1
        else:
            # If the identifier is already there, the value becomes 2.    
            categories[identifier] = 2
    # To go for a second pass through to create the lists
    f.seek(0)
    for line in f:
        splitline = line.strip().split()
        identifier = "".join(splitline[1:3])[1:]
        if categories[identifier] == 1:  # meaning the stem just occurs once
            if splitline[3] != "C":
                result["ace"].append(line.strip())
        else:
            result[splitline[3]].append(line.strip())

您现在可以result["H"]，result["F"]和result["ace"]访问这些群组。

以下是打印结果的代码：

for type in result:
    print("\n",type, "list:", "\n", "\n ".join(result[type]))

H list: 
  line   HF1    1     H     Hyd   7.11    5.11    7.11
  line   HE2    2     H     Hyd   7.11    5.11    7.11

F list: 
  line   EF1    1     F     Flu   5.7     3.221   9.332
  line   EE2    2     F     Flu   5.7     3.221   9.332

ace list: 
  line   EE2    1     F     Flu   5.7     3.221   9.332
  line   EF1    2     F     Flu   5.7     3.221   9.332

配置嵌套的if循环以对数据集进行分类

4 个答案:

首先通过文件：

建立了类别字典后，我们可以再次浏览文件：