我有这样的文件,
bob NULL 0 A A G G G G G
tom NULL 0 A A A A A A A
sara NULL 0 C C C C T T T
jane NULL 0 failed failed failed failed failed failed failed
我需要计算A / C,C / A,A / T,T / A,A / G,G / A,C / G,G / C,C / T,T / C,T / G ,G / T并删除所有纯合线,所以我想要的输出看起来像这样,
bob NULL 0 A A G G G G G G/A
sara NULL 0 C C C C T T T C/T
这是我的尝试,
fileA = open("myfile.txt",'r')
import re
#fileA.next()
lines=fileA.readlines()
for line in lines:
new_list=re.split(r'\t+',line.strip())
snp_name=new_list[0]
allele=new_list[3:]
failed_count = allele.count('failed')
A_count = allele.count('A')
C_count = allele.count('C')
G_count = allele.count('G')
T_count = allele.count('T')
#A/C OR C/A count
if A_count > 0:
if C_count > 0:
if A_count > C_count:
new_list.append('A/C')
else:
new_list.append('C/A')
#A/T OR T/A count
if T_count > 0:
if A_count > T_count:
new_list.append('A/T')
else:
new_list.append('T/A')
#A/G OR G/A count
if G_count > 0:
if A_count > G_count:
new_list.append('A/G')
else:
new_list.append('G/A')
#C/G OR G/C count
if C_count > 0:
if G_count > 0:
if C_count > G_count:
new_list.append('C/G')
else:
new_list.append('G/C')
#C/T OR T/C count
if T_count > 0:
if C_count > T_count:
new_list.append('C/T')
else:
new_list.append('T/C')
#T/G OR G/T count
if T_count > 0:
if G_count > 0:
if T_count > G_count:
new_list.append('T/G')
else:
new_list.append('G/T')
r=open('allele_counts.txt', 'a')
x='\t'.join(new_list)
x=x+'\n'
r.writelines(x)
fileA.close()
r.close()
你能建议我如何改进代码并删除所有纯合系吗?
答案 0 :(得分:0)
问题可能来自您编写文件的方式,您需要确保将列与实际tabs
分开
编辑myfile.txt
时,您的代码可以正常使用
问题是您在'A'
计算的上一个列表如下:
['bob NULL 0 A A G G G G G']
你需要它:
['bob', 'NULL', '0', 'A', 'A', 'G', 'G', 'G', 'G', 'G']
答案 1 :(得分:0)
也许这个重构可以提供帮助:
import re
from collections import Counter
from operator import itemgetter
# Use with so that you don't forget to close the file in the end. Also, it is
# more pythonic
with open("myfile.txt",'r') as fileA:
with open('allele_counts.txt', 'a') as fileB:
# The file object is in itself an iterator, so you can iterate over it
for line in fileA:
new_list = re.split(r'\t+',line.strip())
allele = new_list[3:]
failed_count = allele.count('failed')
# Use python's counter object to do the counting
counts = Counter(allele)
# Get the top two most common strings. This returns a list of
# tuples with item and its count
top_two = counts.most_common(2)
# We only need the item, so pluck that out from the list
classification = '/'.join(map(itemgetter(0), top_two))
# Add our classification to the new output list
new_list.append(classification)
# write to file
fileB.write('\t'.join(new_list))
答案 2 :(得分:0)
另一种方法是使用pandas DataFrame:
import pandas as pd
df = pd.read_table('myfile.txt', header=None, sep=" ", skipinitialspace=True)
select = ['A', 'G', 'C', 'T', 'failed']
# select out all the heterozygous rows
for elem in select:
df = df[(df.iloc[:,3:10] != elem).any(axis=1)]
# reset the index since we removed rows
df = df.reset_index(drop=True)
df[10] = '' # column 10 will hold the tags
# add the tag to the line in the form A/B where count('A') > count('B') for a row
for i in range(df.shape[0]):
tags = df.iloc[i, 3:10].unique().tolist()
if sum(df.iloc[i, 3:10] == tags[0]) < sum(df.iloc[i, 3:10] == tags[1]):
tags.reverse()
df.iloc[i, 10] = '/'.join(tags)
df.to_csv('allele_counts.txt', sep=" ", header=False, index=False, na_rep='NULL')
当我使用myfile.txt运行它时,我得到以下allel_counts.txt:
bob NULL 0 A A G G G G G G/A
sara NULL 0 C C C C T T T C/T