我有一个这样的文件:
A2ML1,ENST00000541459
A2ML1,ENST00000545692
A2ML1,ENST00000545850
A3GALT2,ENST00000442999
A4GALT,ENST00000249005
A4GALT,ENST00000381278
我想把这些行分组:
A2ML1,ENST00000541459,ENST00000545692,ENST00000545850
A3GALT2,ENST00000442999
A4GALT,ENST00000249005,ENST00000381278
这是我在python中的代码,即将文件保留为原始XD:
import sys
with open('gene_list.csv', 'r') as file_open:
iterfile = iter(file_open)
for line in iterfile:
l = line.split(",")
select = l[0]
linext = iterfile.next()
linext2 = linext.split(",")
if select == linext2[0]:
sys.stdout.write(select + ',' + linext2[1])
next(file_open)
else:
sys.stdout.write(select + ',' + l[1])
我知道这很容易,但我坚持这个。我真的很感激你的帮助。谢谢!
答案 0 :(得分:2)
希望这会有所帮助:)
import csv
import collections
#Read in the data as a dictionary
with open('gene_list.csv', 'r') as fd:
reader = csv.reader(fd)
#If you have headers in the CSV file you want to skip
#next(reader, None)
#This dict will have key:value, value=list type
unique_first_col = collections.defaultdict(list)
for row in reader:
unique_first_col[row[0]].append(row[1])
with open('output.csv', 'w') as fd:
#Sorted dictionary
sorted_d = collections.OrderedDict(sorted(unique_first_col.items()))
for k, v in sorted_d.items():
fd.write("%s, %s\n" % (k, ','.join(v)))
注意:
collections.defaultdict
了解strip()
答案 1 :(得分:0)
如果你需要尝试大熊猫,你可以这样做: -
import pandas as pd
df = pd.read_csv("gene_list.csv", header=None)
df.columns = ["First", "Second"]
df.groupby("First")["Second"].agg({"Second":lambda x:", ".join(x.astype(str))})
答案 2 :(得分:0)
简单的解决方案是使用第一个值作为字典键。使用defaultdict并不是绝对必要,但它可以更容易地构建二级值列表。
from collections import defaultdict
merged = defaultdict(list)
with open('gene_list.csv', 'r') as f:
for raw_line in f:
line = raw_line.strip()
first, second = line.split(',')
merged[first].append(second)
for key, values in merged:
print(key + ',' + ','.join(values))
如果您必须假设您的原始文件可以有两个以上的键值对,那么您需要稍微调整一下这个脚本。