Question

我有一个这样的文件：

A2ML1,ENST00000541459
A2ML1,ENST00000545692
A2ML1,ENST00000545850
A3GALT2,ENST00000442999
A4GALT,ENST00000249005
A4GALT,ENST00000381278

我想把这些行分组：

A2ML1,ENST00000541459,ENST00000545692,ENST00000545850
A3GALT2,ENST00000442999
A4GALT,ENST00000249005,ENST00000381278

这是我在python中的代码，即将文件保留为原始XD：

import sys

with open('gene_list.csv', 'r') as file_open:
    iterfile = iter(file_open)
    for line in iterfile:
        l = line.split(",")
        select = l[0]
        linext = iterfile.next()
        linext2 = linext.split(",")
        if select == linext2[0]:
            sys.stdout.write(select + ',' + linext2[1])
            next(file_open)
        else:
            sys.stdout.write(select + ',' + l[1])

我知道这很容易，但我坚持这个。我真的很感激你的帮助。谢谢！

Answer 1

希望这会有所帮助：）

import csv
import collections

#Read in the data as a dictionary
with open('gene_list.csv', 'r') as fd:

    reader = csv.reader(fd)

    #If you have headers in the CSV file you want to skip
    #next(reader, None)

    #This dict will have key:value, value=list type
    unique_first_col = collections.defaultdict(list)
    for row in reader:
        unique_first_col[row[0]].append(row[1])

with open('output.csv', 'w') as fd:

    #Sorted dictionary
    sorted_d = collections.OrderedDict(sorted(unique_first_col.items()))
    for k, v in sorted_d.items():
        fd.write("%s, %s\n" % (k, ','.join(v)))

注意：

请参阅collections.defaultdict了解strip()
有关CSV处理的信息，请参阅this question
在键入字典之前，您可能需要考虑使用简单的字符串“预处理”，例如{{1}}，因为尾随空格可能会导致将密钥作为新密钥输入。
有关排序词典的信息，请参阅the documentation

Answer 2

如果你需要尝试大熊猫，你可以这样做： -

import pandas as pd
df = pd.read_csv("gene_list.csv", header=None)
df.columns = ["First", "Second"]
df.groupby("First")["Second"].agg({"Second":lambda x:", ".join(x.astype(str))})

Answer 3

简单的解决方案是使用第一个值作为字典键。使用defaultdict并不是绝对必要，但它可以更容易地构建二级值列表。

from collections import defaultdict

merged = defaultdict(list)

with open('gene_list.csv', 'r') as f:
    for raw_line in f:
        line = raw_line.strip()
        first, second = line.split(',')
        merged[first].append(second)

for key, values in merged:
    print(key + ',' + ','.join(values))

如果您必须假设您的原始文件可以有两个以上的键值对，那么您需要稍微调整一下这个脚本。

从一行中的同一第一列连接值

3 个答案: