我有一个像这样的输入文件:
structureId chainId resolution uniprotAcc structureMolecularWeight
101M A 2.07 P02185 18112.8
102L A 1.74 P00720 18926.61
103D A 7502.93
103D B 7502.93
103L A 1.9 P00720 19092.72
103M A 2.07 P02185 18093.78
104L A 2.8 P00720 37541.04
104L B 2.8 P00720 37541.04
104M A 1.71 P02185 18030.63
104M A 3.1 P09323 2312.2
我希望输出看起来像这样:
structureId chainId resolution uniprotAcc structureMolecularWeight
101M A 2.07 P02185 18112.8
102L A 1.74 P00720 18926.61
103D A 7502.93
103D B 7502.93
103L A 1.9 P00720 19092.72
103M A 2.07 P02185 18093.78
104L A,B 2.8 P00720 37541.04
104M A 1.71 P02185 18030.63
104M A 3.1 P09323 2312.2
即,如果col'uniprotAcc'与col'structureId'相同;结合起来。
我写了这段代码:
import sys
set_of_ids = list(set([line.strip().split('\t')[0] for line in open(sys.argv[1])]))
master_dict = {}
for line in open(sys.argv[1]):
split_line = line.strip().split('\t')
if split_line[0] not in master_dict:
master_dict[split_line[0]] = [split_line[1:]]
else:
master_dict[split_line[0]].append(split_line[1:])
print(master_dict)
结合了数据,所以关键是structureID,值是structureId涉及的行的列表:
{'structureId': [['chainId', 'resolution', 'uniprotAcc', 'structureMolecularWeight']], '101M': [['A', '2.07', 'P02185', '18112.8']], '102L': [['A', '1.74', 'P00720', '18926.61']], '103D': [['A', '', '', '7502.93'], ['B', '', '', '7502.93']], '103L': [['A', '1.9', 'P00720', '19092.72']], '103M': [['A', '2.07', 'P02185', '18093.78']], '104L': [['A', '2.8', 'P00720', '37541.04'], ['B', '2.8', 'P00720', '37541.04']], '104M': [['A', '1.71', 'P02185', '18030.63'], ['A', '3.1', 'P09323', '2312.2']]}
我只是停留在一件小事情上,我知道如何遍历字典:
for k in master_dict:
for each_list in master_dict[k]:
我只是停留在下一行,怎么说'合并除第一个(假设列表从0开始)之外的相同列表。
即所以转:
104L A 2.8 P00720 37541.04
104L B 2.8 P00720 37541.04
进入:
104L A,B 2.8 P00720 37541.04
基本上,对于我表中的行,我可能会让听起来复杂得多,如果每个结构ID和uniProtacc的唯一区别是chainID列,请组合chainID列。
编辑1:在下面回答问题?
例如,这是数据:
structureId chainId resolution uniprotAcc structureMolecularWeight
6YC3 A 2.0 N0DKS8 181807.39
6YC3 B 2.0 N0DKS8 181807.39
6YC3 C 2.0 N0DKS8 181807.39
6YC3 D 2.0 N0DKS8 181807.39
6YC3 E 2.0 N0DKS8 181807.39
6YC4 A 2.6 N0DKS8 174142.86
6YC4 B 2.6 N0DKS8 174142.86
6YC4 C 2.6 N0DKS8 174142.86
6YC4 D 2.6 N0DKS8 174142.86
6YC4 E 2.6 N0DKS8 174142.86
因此,输出应为:
6YC3 A,B,C,D,E 2.0 N0DKS8 181807.29
6YC4 A,B,C,D,E 2.6 N0DKS8 174142.86
以下代码的输出是:
['6YC3', 'B,B,C,D,E,A,B,C,D,E', '2.0', 'N0DKS8', '181807.39']
编辑2:为避免出现上述问题,我创建了一个专栏,将UniProt的登录名和structureID组合在一起:
structureId chainId resolution uniprotAcc structureMolecularWeight newcode
6YC3 A 2.0 N0DKS8 181807.39 N0DKS8_6YC3
6YC3 B 2.0 N0DKS8 181807.39 N0DKS8_6YC3
6YC3 C 2.0 N0DKS8 181807.39 N0DKS8_6YC3
6YC3 D 2.0 N0DKS8 181807.39 N0DKS8_6YC3
6YC3 E 2.0 N0DKS8 181807.39 N0DKS8_6YC3
6YC4 A 2.6 N0DKS8 174142.86 N0DKS8_6YC4
6YC4 B 2.6 N0DKS8 174142.86 N0DKS8_6YC4
6YC4 C 2.6 N0DKS8 174142.86 N0DKS8_6YC4
6YC4 D 2.6 N0DKS8 174142.86 N0DKS8_6YC4
6YC4 E 2.6 N0DKS8 174142.86 N0DKS8_6YC4
然后我只是替换了代码行:
idx_uniprotAcc = headers.index("uniprotAcc") #to...
idx_uniprotAcc = headers.index("newcode")
当我运行与下面完全相同的代码时,只更改了一行,输出为:
['6YC3', 'B,B,C,D,E', '2.0', 'N0DKS8', '181807.39', 'N0DKS8_6YC3']
['6YC4', 'A,B,C,D,E', '2.6', 'N0DKS8', '174142.86', 'N0DKS8_6YC4']
为什么第一行返回“ B,B,C,D,E”而不返回“ A,B,C,D,E”。我认为这与遍历数据[1:]有关?
答案 0 :(得分:1)
您可以使用内置的zip
进行逐项污染。 map
可用于进一步处理。
对于给定的输入-
item = [['A', '2.8', 'P00720', '37541.04'], ['B', '2.8', 'P00720', '37541.04']]
output=list(map(lambda t: t[0] if t[0]==t[1] else t[0]+","+t[1], list(zip(*a))))
结果是-
['A,B', '2.8', 'P00720', '37541.04']
注意:map
中的lambda假设最多污染了2行。您也可以轻松地将其更改为n。
答案 1 :(得分:1)
让我们尝试以下方法:
打开文件并阅读所有行。为此,我们可以使用readlines()
。它以list
的形式返回所有行。 (有关更多详细信息,此tuto解释了如何使用它。)
strip
来清理字符串。re
模块中使用 regex 。 re.split
方法允许根据正则表达式进行拆分。使用的模式为\s+
,其中\s
代表 space ,而+
代表一个或多个。 第一步可以总结为以下两行:
with open("data.txt") as f:
data = [re.split(r'\s+', line.strip()) for line in f.readlines()]
headers = data[0]
作为首行标题,遍历所有行。我们使用enumerate
来获取当前索引(并推断出前一行)。
uniprotAcc
:我们通过添加当前chainId
完整代码
import re
# Read file
with open("data.txt") as f:
data = [re.split(r'\s+', line.strip()) for line in f.readlines()]
print(data)
# Select headers
headers = data[0]
# Get index columns if not known
idx_uniprotAcc = headers.index("uniprotAcc")
idx_structureId = headers.index("structureId")
idx_chainId = headers.index("chainId")
# Remove header line
data = data[1:]
# In any case, we can add the header and first line to the output
out = [headers, data[0]]
print(out)
# Iterate over the lines starting at the second one
for i, line in enumerate(data[1:]):
# Get preivous line (i start at 0 but data is started at first line)
prev_line = data[i]
# print("prev: ", prev_line)
# print("current: ", line)
# Check line are the same and they both have all the values
# Here you can add as any column check as you want
# (here I just added one on "structureId" as this seems to match the output
# but to be sure, it's may be better to check all the columns)
if len(line) == len(headers) and \
len(prev_line) == len(headers) and \
line[idx_uniprotAcc] == prev_line[idx_uniprotAcc] and line[idx_structureId] == prev_line[idx_structureId]:
# Merge current with previous output line
out[-1][idx_chainId] += ",{}".format(line[idx_chainId])
else:
# Line is added
out.append(line)
[print(x) for x in out]
# ['102L', 'A', '1.74', 'P00720', '18926.61']
# ['102L', 'A', '1.74', 'P00720', '18926.61']
# ['103D', 'A', '7502.93']
# ['103D', 'B', '7502.93']
# ['103L', 'A', '1.9', 'P00720', '19092.72']
# ['103M', 'A', '2.07', 'P02185', '18093.78']
# ['104L', 'A,B', '2.8', 'P00720', '37541.04']
# ['104M', 'A', '1.71', 'P02185', '18030.63']
# ['104M', 'A', '3.1', 'P09323', '2312.2']
# ['6YC3', 'A,B,C,D,E', '2.0', 'N0DKS8', '181807.39']
# ['6YC4', 'A,B,C,D,E', '2.6', 'N0DKS8', '174142.86']
# Export in text file
# with open('output.txt', 'w') as f:
# f.writelines("%s\n" % " ".join(x) for x in out)
希望有帮助!