Question

我有一个示例inputfile.txt：

chr1    34870071    34899867    pi-Fam168b.1    -
chr11   98724946    98764609    pi-Wipf2.1  +
chr11   105898192   105920636   pi-Dcaf7.1  +
chr11   120486441   120495268   pi-Mafg.1   -
chr12   3891106 3914443 pi-Dnmt3a.1 +
chr12   82815946    82882157    pi-Map3k9.1 -
chr13   23855536    23856215    pi-Hist1h1a.1   +
chr13   55206682    55236190    pi-Zfp346.1 +
chr1    95700553    95718679    pi-Ing5.1   +
chr13   55313417    55419685    pi-Nsd1.1   +
chr14   27852218    27920472    pi-Il17rd.1 +
chr14   65430438    65568699    pi-Hmbox1.1 -
chr1    120524521   120581739   pi-Tfcp2l1.1    +
chr15   81633147    81657289    pi-Tef.1    +
chr15   89331804    89390691    pi-Shank3.1 +
chr15   103021983   103070259   pi-Cbx5.1   -
chr16   16896549    16927451    pi-Ppm1f.1  +
chr16   17233679    17263523    pi-Hic2.1   +
chr16   17452059    17486929    pi-Crkl.1   +
chr16   24393531    24992661    pi-Lpp.1    +
chr16   43964878    43979143    pi-Zdhhc23.1    -
chr17   25098236    25152532    pi-Cramp1l.1    -
chr17   27993451    28036985    pi-Uhrf1bp1.1   +
chr17   83973363    84031786    pi-Kcng3.1  -
chr1    133904194   133928161   pi-Elk4.1   +
chr18   60844148    60908308    pi-Ndst1.1  -
chr19   10057193    10059582    pi-Fth1.1   +
chr19   44637337    44650762    pi-Hif1an.1 +
chr1    135027714   135036359   pi-Ppp1r15b.1   +
chr2    28677821    28695861    pi-Gtf3c4.1 -
chr1    136651241   136852527   pi-Ppp1r12b.1   -
chr2    154262219   154365092   pi-Cbfa2t2.1    +
chr2    156022393   156135687   pi-Phf20.1  +
chr3    51028854    51055547    pi-Ccrn4l.1 +
chr3    94985683    95021902    pi-Gabpb2.1 -
chr1    158488203   158579750   pi-Abl2.1   +
chr4    45411294    45421633    pi-Mcart1.1 -
chr4    56879897    56960355    pi-D730040F13Rik.1  -
chr4    59818521    59917612    pi-Snx30.1  +
chr4    107847846   107890527   pi-Zyg11a.1 -
chr4    107900359   107973695   pi-Zyg11b.1 -
chr4    132195002   132280676   pi-Eya3.1   +
chr4    134968222   134989706   pi-Rcan3.1  -
chr4    136025678   136110697   pi-Luzp1.1  +
chr1    162933052   162964958   pi-Zbtb37.1 -
chr5    38591490    38611628    pi-Zbtb49.1 -
chr5    67783388    67819359    pi-Bend4.1  -
chr5    114387108   114443767   pi-Ssh1.1   -
chr5    115592990   115608225   pi-Mlec.1   -
chr5    143628624   143656891   pi-Fbxl18.1 -
chr1    172123561   172145541   pi-Uhmk1.1  -
chr6    83312367    83391602    pi-Tet3.1   -
chr6    85419571    85434653    pi-Fbxo41.1 -
chr6    116288039   116359551   pi-March08.1    +
chr6    120786229   120842859   pi-Bcl2l13.1    +
chr7    71031236    71083761    pi-Klf13.1  -
chr7    107068766   107128968   pi-Rnf169.1 -
chr7    139903770   140044311   pi-Fam53b.1 -
chr8    72285224    72298794    pi-Zfp866.1 -
chr8    106872110   106919708   pi-Cmtm4.1  -
chr8    112250549   112261649   pi-Atxn1l.1 -
chr10   41901651    41911816    pi-Foxo3.1  -
chr8    119682164   119739895   pi-Gan.1    +
chr8    125406988   125566154   pi-Ankrd11.1    -
chr9    27148219    27165314    pi-Igsf9b.1 +
chr9    44100521    44113717    pi-Hinfp.1  -
chr9    61761092    61762348    pi-Rplp1.1  -
chr9    106590412   106691503   pi-Rad54l2.1    -
chr9    114416339   114473487   pi-Trim71.1 -
chr9    119311403   119351032   pi-Acvr2b.1 +
chr9    119354082   119373348   pi-Exog.1   +
chr10   82822985    82831579    pi-D10Wsu102e.1 +
chr10   126415753   126437016   pi-Ctdsp2.1 +
chr1    90159688    90174093    pi-Hjurp.1  -
chr11   60591039    60597792    pi-Smcr8.1  +
chr11   69209318    69210176    pi-Lsmd1.1  +
chr11   75345218    75391069    pi-Slc43a2.1    +
chr11   79474214    79511524    pi-Rab11fip4.1  +
chr11   95818479    95868022    pi-Igf2bp1.1    -
chr11   97223641    97259855    pi-Socs7.1  +
chr11   97524530    97546757    pi-Mllt6.1  +
chr1    120355721   120355843   1-qE2.3-2.1 -
chr2    120518324   120540873   2-qE5-4.1   +
chr7    82913927    82926993    7-qD2-40.1  -

列1 = chromosome_number

列2 =起动

栏3 =端

Column4 = gene_name

Column5 = Orientation（+或 - ）

1。）我需要提取相同染色体数（第1列）的行，它们的起始位点差异为200最大值（200以下） （column2）处于相反方向（一个是正/负）。

这是我到目前为止所不知道我的错误在哪里：

import csv
import itertools as it
f=open('inputfile.txt', 'r')

def getrecords(f):
    for line in open(f):
        yield line.strip().split()
key=lambda x: x[0]
for i, rec in it.groupby(sorted(getrecords('inputfile.txt'), key=key), key=key):
    for c0, c1 in it.combinations(rec, 2):
        if (c0[4]!= c1[4] and (abs(int(c0[1])-int(c1[1]))) < 200):
            print ("%s\t%s\t%s" % (c0[0], c0[1], c0[3]))
            print("%s\t%s\t%s" % (c1[0], c1[1], c1[3]))

*请注意：此代码运行，但我需要考虑负面（' - '）方向的终端网站（第3列）**换句话说，比较时，比较起始网站是否有如果“end site”/ column3具有负面方向，则为“+”方向。 如何编辑我的代码以满足所有条件？

我期待有大约15个独特的序列行。

然后我会对这些行进行排序以消除重复。

Answer 1

您的检查看起来正确“相同的染色体编号”，“起始点差异为200或更少”，以及“相反方向”。

我为起始网站差异添加了一个print语句，发现你的差异值都没有接近200.其中大多数都是数百万。在这个示例文件中，您知道您希望打印哪些文件吗？

对于方向，我不明白你的意思是开始和结束有不同的方向，因为每一行只有一个方向。

Answer 2

如果您的文本文件中的标题位于具有列名称的所有行上方，例如：

chromosome_number    start    end    gene_name    Orientation

并且，突然，您安装了pandas软件包，您可以使用代码提取必要的值：

import pandas
import itertools

# delim_whitespace: Parse whitespace-delimited (spaces or tabs) file (much faster than using a regular expression)
data = pandas.read_table('inputfile.txt', delim_whitespace=True)
# group by chromosome_number
for name, group in data.groupby('chromosome_number'):
    # check differences of start site value between each other
    for a, b in itertools.combinations(group['start'], 2):
        # if difference <= 1000000
        if (abs(a - b) <= 1000000):
            # if orientations are opposite
            if (group.loc[group['start'] == a]['Orientation'].iloc[0] != group.loc[group['start'] == b]['Orientation'].iloc[0]):
                print(group.loc[group['start'] == a])
                print(group.loc[group['start'] == b])

在这种情况下，差值等于1000000.Output看起来像：

   chromosome_number      start        end     gene_name Orientation
12              chr1  120524521  120581739  pi-Tfcp2l1.1           +
   chromosome_number      start        end    gene_name Orientation
81              chr1  120355721  120355843  1-qE2.3-2.1           -

Python：迭代.txt文件以提取数据以符合您的条件

2 个答案: