Question

我试图编写一个python脚本来从以下数据帧生成计数。我在excel中使用了countifs，但在＆＃39; Sample＆＃39;中重复使用了。和＆＃39;地区＆＃39;导致使用countifs的问题。

示例输入df：

Sample  Chr Start   End Region  Size    Strand  Chr2    Start2  End2    Coverage    Overlap
101 chr1    198661465   198661475   NM_002838_PTPRC_intron_2_R  10  +   chr1    198608563   198661471   0   6
101 chr1    198661465   198661475   NM_001267798_PTPRC_intron_2_R   10  +   chr1    198608563   198661471   0   6
101 chr1    198661465   198661475   NM_080921_PTPRC_intron_2_R  10  +   chr1    198608563   198661471   0   6
101 chr1    236966727   236966942   NM_000254_MTR_cds_2 215 +   chr1    236966742   236966743   11  1
101 chr1    236966727   236966942   NM_001291939_MTR_cds_2  215 +   chr1    236966742   236966743   11  1
101 chr1    236966742   236966942   NM_001291940_MTR_5utr_2 200 +   chr1    236966742   236966743   11  1
101 chr1    236979843   236979853   NM_000254_MTR_intron_8_L    10  +   chr1    236979846   236979847   9   1
101 chr1    236979843   236979853   NM_000254_MTR_intron_8_L    10  +   chr1    236979847   236979848   8   1
101 chr1    236979843   236979853   NM_000254_MTR_intron_8_L    10  +   chr1    236979848   236979852   7   4
101 chr1    236979843   236979853   NM_000254_MTR_intron_8_L    10  +   chr1    236979852   236979854   6   1
101 chr1    236979843   236979853   NM_001291940_MTR_intron_8_L 10  +   chr1    236979846   236979847   9   1
101 chr1    236979843   236979853   NM_001291940_MTR_intron_8_L 10  +   chr1    236979847   236979848   8   1
101 chr1    236979843   236979853   NM_001291940_MTR_intron_8_L 10  +   chr1    236979848   236979852   7   4

因此，单个样本可以具有相同的＆＃39;区域＆＃39;不止一次列出（不同的坐标，但这不重要）。

所需的输出1 - 按＆＃39; Sample＆＃39;如果＆＃39;地区＆＃39;包含＆＃34; utr＆＃34;或＆＃34;内含子＆＃34;或者＆＃34; cds＆＃34;，考虑重复＆＃39;地区＆＃39;每个样本＆＃39;：

Sample  Total   Intron  UTR CDS
101 68  40  13  15
102 64  38  13  13

所需的输出2 - 重叠＆＃39;的总和通过＆＃39; Sample＆＃39;如果＆＃39;地区＆＃39;包含＆＃34; utr＆＃34;或＆＃34;内含子＆＃34;或＆＃34; cds＆＃34;：

Sample  Total   Intron  UTR CDS
101 2838    321 1433    1084
102 2524    291 1449    784

所需的输出3 - ＆＃39;地区列表＆＃39;对具有该区域＆＃39;的样本数量进行计数。列出

Region  Num Samples
ENST00000390559_IGHM_cds_4  2
ENST00000390559_IGMH_cds_1  2
ENST00000390559_IGMH_cds_2  2
ENST00000390559_IGMH_cds_3  12
ENST00000390559_IGMH_intron_1_L 2
ENST00000390559_IGMH_intron_1_R 2
ENST00000390559_IGMH_intron_2_L 10

编辑：我已经弄清楚如何获得输出＃3：

df.groupby('Region').Sample.nunique()

我可以通过以下方式获得输出＃1的总计：

df.groupby('Sample').Region.nunique()

现在我只需要弄清楚如何过滤我的群组以包含＆＃39; utr / cds / intron＆＃39;并总结了重叠＆＃39;已过滤的群组。

Answer 1

如果有人遇到类似的问题，这就是我想出来生成所描述的三个输出。它可能不是最优雅的解决方案，但它有效！

import pandas as pd
import argparse
import os
import sys

#arguments
parser = argparse.ArgumentParser(description="Generate counts by sample and total bases by sample of low coverage regions") 

parser.add_argument("-i", "--input", help="input filename", required=True)
parser.add_argument("-o", "--output", help="output basename", required=True) 

args = parser.parse_args() 

#output filenames
region_count_file = args.output + "_region_count.txt"
bases_count_file = args.output + "_bases_count.txt"
sample_count_file = args.output + "_sample_count.txt"

#read in
df = pd.read_table(args.input)

#check output doesn't exist
if os.path.exists(region_count_file) or os.path.exists(bases_count_file) or os.path.exists(sample_count_file):
    sys.exit("ERROR: output basename %s files already exist" % args.output)

#for filtering on different regions
intron = df['Region'].str.contains('intron')
utr = df['Region'].str.contains('utr')
cds = df['Region'].str.contains('cds')

#count regions per sample
unique_regions = df.groupby('Sample').Region.nunique()
unique_intron = df[intron].groupby('Sample').Region.nunique()
unique_utr = df[utr].groupby('Sample').Region.nunique()
unique_cds = df[cds].groupby('Sample').Region.nunique()

#sum bases per sample
bases_total = df.groupby(['Sample'])['Overlap'].sum()
bases_intron = df[intron].groupby(['Sample'])['Overlap'].sum()
bases_utr = df[utr].groupby(['Sample'])['Overlap'].sum()
bases_cds = df[cds].groupby(['Sample'])['Overlap'].sum()

#count samples per region
samples_per_region = df.groupby('Region').Sample.nunique()

#format regions per sample for output
combine_region_count = pd.concat([unique_regions,unique_intron,unique_utr,unique_cds], axis=1)
combine_region_count.columns = 'Total','Intron','UTR','CDS'

#format bases per sample for output
combine_bases = pd.concat([bases_total,bases_intron,bases_utr,bases_cds], axis=1)
combine_bases.columns = 'Total','Intron','UTR','CDS'

#format samples per region for output
#samples_per_region.reset_index(name='Num Samples')
#not sure why this is not working, but not that important


#output each
combine_region_count.to_csv(region_count_file,sep='\t')
combine_bases.to_csv(bases_count_file,sep='\t')
samples_per_region.to_csv(sample_count_file,sep='\t')

如果条件为真，pandas计数唯一

1 个答案: