Question

我有这样的文件，

bob     NULL    0   A   A   G   G   G   G   G
tom     NULL    0   A   A   A   A   A   A   A
sara    NULL    0   C   C   C   C   T   T   T
jane    NULL    0   failed  failed  failed  failed  failed  failed  failed

我需要计算A / C，C / A，A / T，T / A，A / G，G / A，C / G，G / C，C / T，T / C，T / G ，G / T并删除所有纯合线，所以我想要的输出看起来像这样，

bob     NULL    0   A   A   G   G   G   G   G  G/A
sara    NULL    0   C   C   C   C   T   T   T  C/T

这是我的尝试，

fileA = open("myfile.txt",'r')
import re
#fileA.next()
lines=fileA.readlines()
for line in lines:
  new_list=re.split(r'\t+',line.strip())
  snp_name=new_list[0]
  allele=new_list[3:]
  failed_count = allele.count('failed')
  A_count = allele.count('A')
  C_count = allele.count('C')
  G_count = allele.count('G')
  T_count = allele.count('T')
 #A/C OR C/A count
  if A_count > 0:
    if C_count > 0:
      if A_count > C_count:
        new_list.append('A/C')
      else:
        new_list.append('C/A')
#A/T OR T/A count
    if T_count > 0:
      if A_count > T_count:
        new_list.append('A/T')
      else:
        new_list.append('T/A')
#A/G OR G/A count
    if G_count > 0:
      if A_count > G_count:
        new_list.append('A/G')
      else:
        new_list.append('G/A')
#C/G OR G/C count
  if C_count > 0:
     if G_count > 0:
      if C_count > G_count:
        new_list.append('C/G')
      else:
        new_list.append('G/C')
#C/T OR T/C count
    if T_count > 0:
      if C_count > T_count:
        new_list.append('C/T')
      else:
        new_list.append('T/C')
#T/G OR G/T count
  if T_count > 0:
    if G_count > 0:
      if T_count > G_count:
        new_list.append('T/G')
      else:
        new_list.append('G/T')
  r=open('allele_counts.txt', 'a')
  x='\t'.join(new_list)
  x=x+'\n'
  r.writelines(x)
fileA.close()
r.close()

你能建议我如何改进代码并删除所有纯合系吗？

Answer 1

问题可能来自您编写文件的方式，您需要确保将列与实际tabs分开编辑myfile.txt时，您的代码可以正常使用问题是您在'A'计算的上一个列表如下：

['bob     NULL    0   A   A   G   G   G   G   G']

你需要它：

['bob', 'NULL', '0', 'A', 'A', 'G', 'G', 'G', 'G', 'G']

Answer 2

也许这个重构可以提供帮助：

import re
from collections import Counter
from operator import itemgetter

# Use with so that you don't forget to close the file in the end. Also, it is
# more pythonic
with open("myfile.txt",'r') as fileA:
    with open('allele_counts.txt', 'a') as fileB:
        # The file object is in itself an iterator, so you can iterate over it
        for line in fileA:
            new_list = re.split(r'\t+',line.strip())
            allele = new_list[3:]
            failed_count = allele.count('failed')

            # Use python's counter object to do the counting
            counts = Counter(allele)
            # Get the top two most common strings. This returns a list of
            # tuples with item and its count
            top_two = counts.most_common(2)
            # We only need the item, so pluck that out from the list
            classification = '/'.join(map(itemgetter(0), top_two))

            # Add our classification to the new output list
            new_list.append(classification)
            # write to file
            fileB.write('\t'.join(new_list))

Answer 3

另一种方法是使用pandas DataFrame：

import pandas as pd

df = pd.read_table('myfile.txt', header=None, sep=" ", skipinitialspace=True)

select = ['A', 'G', 'C', 'T', 'failed']

# select out all the heterozygous rows
for elem in select:
    df = df[(df.iloc[:,3:10] != elem).any(axis=1)]

# reset the index since we removed rows
df = df.reset_index(drop=True)
df[10] = '' # column 10 will hold the tags

# add the tag to the line in the form A/B where count('A') > count('B') for a row
for i in range(df.shape[0]):
    tags = df.iloc[i, 3:10].unique().tolist()
    if sum(df.iloc[i, 3:10] == tags[0]) < sum(df.iloc[i, 3:10] == tags[1]):
        tags.reverse()
    df.iloc[i, 10] = '/'.join(tags)

df.to_csv('allele_counts.txt', sep=" ", header=False, index=False, na_rep='NULL')

当我使用myfile.txt运行它时，我得到以下allel_counts.txt：

bob NULL 0 A A G G G G G G/A
sara NULL 0 C C C C T T T C/T

等位形式计数和删除纯合系

3 个答案: