Python连接具有相同索引的元素

时间:2015-03-23 08:38:36

标签: python

我有一个有几千行的文件。我想逐行填充字典。该基因可以作为关键。如果找到基因,它只会附加"休息"作为价值观。我想用逗号加入值。这就是我现在所处的位置。

listfile = {}

with open("Desktop/testfile", "r") as f:
    for lines in f:
        lines=lines.strip()
        gene=lines.split()[0]
        rest = lines.split()[1:]


        if gene not in listfile:
            listfile[gene] = rest
            #print gene, rest
        else:
            for items in rest:

                listfile[gene].append(items)    


for items in listfile.items():
    print items

输入:

ACCA    39072094753 D   12
ACCA    983954875454    G   11
ACCA    098540980985    F   22

输出:

('ACCA', ['39072094753', 'D', '12', '983954875454', 'G', '11', '098540980985', 'F', '22'])

预期产出:

('ACCA', ['39072094753','983954875454','098540980985' 'D','G','F', '12','11','22'])

5 个答案:

答案 0 :(得分:1)

这是一个适用于输入文件中任意数量列的通用解决方案:

import collections
import itertools

genes_info = collections.defaultdict(list)

with open("testfile") as genes_file:
    for line in genes_file:
        fields = line.split()
        genes_info[fields[0]].append(fields[1:])  # Stores each row information

# Conversion of the row-first gene information into column-first information:
for gene_info in genes_info.itervalues():
    gene_info[:] = itertools.chain(*zip(*gene_info))

print genes_info

给出

{'ACCA': ['39072094753', '983954875454', '098540980985', 'D', 'G', 'F', '12', '11', '22']}

(如果您需要字典而不是大致相同的默认字典,则可以在末尾添加genes_info = dict(genes_info)。)

如果要将列值保持在一起,请使用更简单的gene_info[:] = zip(*gene_info)。这给出了:

{'ACCA': [('39072094753', '983954875454', '098540980985'), ('D', 'G', 'F'), ('12', '11', '22')]}

实际上,zip()基本上将行转换为列。

PS line.split()会自动删除空字符串,因此系统会自动删除最终换行符:我简化了原始line.strip().split(),其中strip()因此不必要的。

答案 1 :(得分:1)

我猜,你在每一行中都有相同数量的空格分隔值。如果没有,最长的将用于拉链。

from __future__ import print_function 
import itertools
listfile = {}

with open("Desktop/testfile", "r") as f:
    for line in f:
        line = line.strip().split()
        gene = line[0]
        rest = line[1:]

        if gene not in listfile:
            listfile[gene] = []
        listfile[gene].append(rest)

for i in listfile:
    x = i.get()
    print(i, list(itertools.chain(*itertools.izip_longest(*x))))

答案 2 :(得分:0)

这是你如何做到的。

openedFile = open('data.txt', 'r')

largeNumber = []
letter = []
smallNumber = []

for line in openedFile:
    splittedContent = line.split()
    largeNumber.append(splittedContent[1])
    letter.append(splittedContent[2])
    smallNumber.append(splittedContent[3])

print ('ACCA', largeNumber + letter + smallNumber)

输出:

('ACCA', ['39072094753', '983954875454', '098540980985', 'D', 'G', 'F', '12', '11', '22'])

答案 3 :(得分:-1)

如果你只需要输出的逗号分隔字符串,你可以这样做:

print ",".join(listfile.items())

我认为,为了进一步处理,将属性保存在列表中会很有用。

答案 4 :(得分:-1)

看起来是defaultdict

的一个很好的用例
from from collections import defaultdict
listfile = defaultdict(lambda : [])

with open("Desktop/testfile", "r") as f:
    all_lines = (l.split for l in f)
    for line in all_lines:
        first = line[0]
        rest = line[1:]
        listfile[first].extend(rest)