每个键将几乎相同的字典值列表组合在一起

时间:2020-04-17 16:09:29

标签: python json

我有一个像这样的输入文件:

structureId chainId resolution  uniprotAcc  structureMolecularWeight
101M    A   2.07    P02185  18112.8
102L    A   1.74    P00720  18926.61
103D    A                   7502.93
103D    B                   7502.93
103L    A   1.9     P00720  19092.72
103M    A   2.07    P02185  18093.78
104L    A   2.8     P00720  37541.04
104L    B   2.8     P00720  37541.04
104M    A   1.71    P02185  18030.63
104M    A   3.1     P09323  2312.2

我希望输出看起来像这样:

structureId chainId resolution  uniprotAcc  structureMolecularWeight

101M    A   2.07    P02185  18112.8
102L    A   1.74    P00720  18926.61
103D    A                   7502.93
103D    B                   7502.93
103L    A   1.9     P00720  19092.72
103M    A   2.07    P02185  18093.78
104L    A,B 2.8     P00720  37541.04
104M    A   1.71    P02185  18030.63
104M    A   3.1     P09323  2312.2

即,如果col'uniprotAcc'与col'structureId'相同;结合起来。

我写了这段代码:

import sys

set_of_ids = list(set([line.strip().split('\t')[0] for line in open(sys.argv[1])]))

master_dict = {}
for line in open(sys.argv[1]):
    split_line = line.strip().split('\t')
    if split_line[0] not in master_dict:
        master_dict[split_line[0]] = [split_line[1:]]
    else:
        master_dict[split_line[0]].append(split_line[1:])

print(master_dict)

结合了数据,所以关键是structureID,值是structureId涉及的行的列表:

{'structureId': [['chainId', 'resolution', 'uniprotAcc', 'structureMolecularWeight']], '101M': [['A', '2.07', 'P02185', '18112.8']], '102L': [['A', '1.74', 'P00720', '18926.61']], '103D': [['A', '', '', '7502.93'], ['B', '', '', '7502.93']], '103L': [['A', '1.9', 'P00720', '19092.72']], '103M': [['A', '2.07', 'P02185', '18093.78']], '104L': [['A', '2.8', 'P00720', '37541.04'], ['B', '2.8', 'P00720', '37541.04']], '104M': [['A', '1.71', 'P02185', '18030.63'], ['A', '3.1', 'P09323', '2312.2']]}

我只是停留在一件小事情上,我知道如何遍历字典:

for k in master_dict:
    for each_list in master_dict[k]:

我只是停留在下一行,怎么说'合并除第一个(假设列表从0开始)之外的相同列表。

即所以转:

104L    A   2.8     P00720  37541.04
104L    B   2.8     P00720  37541.04

进入:

104L    A,B   2.8     P00720  37541.04

基本上,对于我表中的行,我可能会让听起来复杂得多,如果每个结构ID和uniProtacc的唯一区别是chainID列,请组合chainID列。

编辑1:在下面回答问题?

例如,这是数据:

structureId chainId resolution  uniprotAcc  structureMolecularWeight
6YC3    A   2.0 N0DKS8  181807.39
6YC3    B   2.0 N0DKS8  181807.39
6YC3    C   2.0 N0DKS8  181807.39
6YC3    D   2.0 N0DKS8  181807.39
6YC3    E   2.0 N0DKS8  181807.39
6YC4    A   2.6 N0DKS8  174142.86
6YC4    B   2.6 N0DKS8  174142.86
6YC4    C   2.6 N0DKS8  174142.86
6YC4    D   2.6 N0DKS8  174142.86
6YC4    E   2.6 N0DKS8  174142.86

因此,输出应为:

6YC3 A,B,C,D,E 2.0 N0DKS8 181807.29
6YC4 A,B,C,D,E 2.6 N0DKS8 174142.86

以下代码的输出是:

['6YC3', 'B,B,C,D,E,A,B,C,D,E', '2.0', 'N0DKS8', '181807.39']

编辑2:为避免出现上述问题,我创建了一个专栏,将UniProt的登录名和structureID组合在一起:

structureId chainId resolution  uniprotAcc  structureMolecularWeight    newcode
6YC3    A   2.0 N0DKS8  181807.39   N0DKS8_6YC3
6YC3    B   2.0 N0DKS8  181807.39   N0DKS8_6YC3
6YC3    C   2.0 N0DKS8  181807.39   N0DKS8_6YC3
6YC3    D   2.0 N0DKS8  181807.39   N0DKS8_6YC3
6YC3    E   2.0 N0DKS8  181807.39   N0DKS8_6YC3
6YC4    A   2.6 N0DKS8  174142.86   N0DKS8_6YC4
6YC4    B   2.6 N0DKS8  174142.86   N0DKS8_6YC4
6YC4    C   2.6 N0DKS8  174142.86   N0DKS8_6YC4
6YC4    D   2.6 N0DKS8  174142.86   N0DKS8_6YC4
6YC4    E   2.6 N0DKS8  174142.86   N0DKS8_6YC4

然后我只是替换了代码行:

idx_uniprotAcc = headers.index("uniprotAcc") #to...
idx_uniprotAcc = headers.index("newcode")

当我运行与下面完全相同的代码时,只更改了一行,输出为:

['6YC3', 'B,B,C,D,E', '2.0', 'N0DKS8', '181807.39', 'N0DKS8_6YC3']
['6YC4', 'A,B,C,D,E', '2.6', 'N0DKS8', '174142.86', 'N0DKS8_6YC4']

为什么第一行返回“ B,B,C,D,E”而不返回“ A,B,C,D,E”。我认为这与遍历数据[1:]有关?

2 个答案:

答案 0 :(得分:1)

您可以使用内置的zip进行逐项污染。 map可用于进一步处理。

对于给定的输入-

item = [['A', '2.8', 'P00720', '37541.04'], ['B', '2.8', 'P00720', '37541.04']]

output=list(map(lambda t: t[0] if t[0]==t[1] else t[0]+","+t[1], list(zip(*a))))

结果是-

['A,B', '2.8', 'P00720', '37541.04']

注意:map中的lambda假设最多污染了2行。您也可以轻松地将其更改为n。

答案 1 :(得分:1)

让我们尝试以下方法:

  1. 打开文件并阅读所有行。为此,我们可以使用readlines()。它以list的形式返回所有行。 (有关更多详细信息,此tuto解释了如何使用它。)

    1. 在每一行上,我们应用strip来清理字符串。
    2. 现在我们有一行,我们想从每列中提取值。为此,我们将根据空格分割该字符串。但是,值之间的空格数可能会更改,因此我们将在re模块中使用 regex re.split方法允许根据正则表达式进行拆分。使用的模式为\s+,其中\s代表 space ,而+代表一个或多个

    第一步可以总结为以下两行:

with open("data.txt") as f:
    data = [re.split(r'\s+', line.strip()) for line in f.readlines()]
  1. 选择headers = data[0]作为首行标题,
  2. 遍历所有行。我们使用enumerate来获取当前索引(并推断出前一行)。

    • 如果当前行和上一行具有相同的uniprotAcc:我们通过添加当前chainId
    • 来更新最后一条输出行
    • 否则:我们将当前行添加到输出中

完整代码

import re

# Read file
with open("data.txt") as f:
    data = [re.split(r'\s+', line.strip()) for line in f.readlines()]
print(data)


# Select headers
headers = data[0]
# Get index columns if not known
idx_uniprotAcc = headers.index("uniprotAcc")
idx_structureId = headers.index("structureId")
idx_chainId = headers.index("chainId")
# Remove header line
data = data[1:]

# In any case, we can add the header and first line to the output
out = [headers, data[0]]
print(out)
# Iterate over the lines starting at the second one
for i, line in enumerate(data[1:]):
    # Get preivous line (i start at 0 but data is started at first line)
    prev_line = data[i]

    # print("prev:    ", prev_line)
    # print("current: ", line)

    # Check line are the same and they both have all the values
    # Here you can add as any column check as you want 
    # (here I just added one on "structureId" as this seems to match the output
    #  but to be sure, it's may be better to check all the columns)
    if len(line) == len(headers) and \
            len(prev_line) == len(headers) and \
            line[idx_uniprotAcc] == prev_line[idx_uniprotAcc] and line[idx_structureId] == prev_line[idx_structureId]:
        # Merge current with previous output line
        out[-1][idx_chainId] += ",{}".format(line[idx_chainId])
    else:
        # Line is added
        out.append(line)

[print(x) for x in out]
# ['102L', 'A', '1.74', 'P00720', '18926.61']
# ['102L', 'A', '1.74', 'P00720', '18926.61']
# ['103D', 'A', '7502.93']
# ['103D', 'B', '7502.93']
# ['103L', 'A', '1.9', 'P00720', '19092.72']
# ['103M', 'A', '2.07', 'P02185', '18093.78']
# ['104L', 'A,B', '2.8', 'P00720', '37541.04']
# ['104M', 'A', '1.71', 'P02185', '18030.63']
# ['104M', 'A', '3.1', 'P09323', '2312.2']
# ['6YC3', 'A,B,C,D,E', '2.0', 'N0DKS8', '181807.39']
# ['6YC4', 'A,B,C,D,E', '2.6', 'N0DKS8', '174142.86']

# Export in text file
# with open('output.txt', 'w') as f:
#     f.writelines("%s\n" % "  ".join(x) for x in out)

希望有帮助!