解析并创建条件为

时间:2018-05-18 08:58:20

标签: python pandas parsing

我需要一些python和pandas的帮助。

我实际上在seq1_id列中有一个数据帧,其中包括物种1的序列的seq_id和column 2的序列的sp2

我实际上通过了对这些序列的过滤器并获得了两个数据帧(一个具有sp 1的所有序列通过过滤器)和(一个具有sp2的所有序列通过过滤器)。 / p>

然后我有3 dataframes.

因为在成对中,一个seq可以通过过滤而另一个没有,重要的是只保留两个先前过滤的配对基因,所以我需要做的是实际解析我的第一个df这样的:

Seq_1.id    Seq_2.id
seq1_A     seq8_B
seq2_A     Seq9_B
seq3_A     Seq10_B
seq4_A     Seq11_B

并逐行检查seq1_A中是否存在(第1行)df2,如果df3中也存在seq8_B,请将此行保留在{ {1}}并将其添加到新的df1

这是一个需要输出的例子:

df4

然后因为first df: Seq_1.id Seq_2.id seq1_A seq8_B seq2_A Seq9_B seq3_A Seq10_B seq4_A Seq11_B df2 (sp1) (seq3_A is absent) Seq_1.id seq1_A seq2_A seq4_A df3 (sp2) (Seq11_B is absent) Seq_2.id seq8_B Seq9_B Seq10_B Seq11_B不存在,seq3_A(输出)将是:

df4

谢谢你:)

Seq_1.id   Seq_2.id
    seq1_A     seq8_B
    seq2_A     Seq9_B

我得到了一个ampty输出,只有列名,但不应该那样。 如果您无法测试其上的代码,请输入以下数据:

DF1:

candidates_0035=pd.read_csv("candidates_genes_filtering_0035",sep='\t')
candidates_0042=pd.read_csv("candidates_genes_filtering_0042",sep='\t')
dN_dS=pd.read_csv("dn_ds.out_sorted",sep='\t')

df4 =dN_dS[dN_dS['seq1_id'].isin(candidates_0042['gene'])&dN_dS['seq2_id'].isin(candidates_0035['gene'])]

DF2:

    Unnamed: 0  seq1_id seq2_id dN  dS  Dist_third_pos  Dist_brute  Length_seq_1    Length_seq_2    GC_content_seq1 GC_content_seq2 GC  Mean_length
0   0   g66097.t1_0035_0035 g13600.t1_0042_0042 0.10455938989199982 0.3122332927029104  0.23600000000000002 0.142   535.0   1024.0  49.1588785046729    51.171875   50.165376752336456  535.0
1   1   g45594.t1_0035_0035 g1464.t1_0042_0042  0.5208761055250978  5.430485421797574   0.7120000000000001  0.489   246.0   222.0   47.967479674796756  44.594594594594604  46.28103713469567   222.0
2   2   g50055.t1_0035_0035 g34744.t1_0042_0035 0.08040473491714645 0.4233916132491867  0.262   0.139   895.0   749.0   56.312849162011176  57.67690253671562   56.994875849363396  749.0
3   3   g34020.t1_0035_0035 g12096.t1_0042_0042 0.4385191689737516  26.834927363887587  0.5760000000000001  0.433   597.0   633.0   37.85594639865997   39.810426540284354  38.83318646947217   597.0
4   4   g28436.t1_0035_0042 g35222.t1_0042_0035 0.055299811368483165    0.1181241496387666  0.1 0.069   450.0   461.0   45.111111111111114  44.90238611713666   45.006748614123886  450.0
5   5   g1005.t1_0035_0035  g11524.t1_0042_0042 0.3528036631463747  19.32549458735676   0.71    0.512   3177.0  3804.0  39.06200818382121   52.944269190325976  46.0031386870736    3177.0
6   6   g28456.t1_0035_0035 g31669.t1_0042_0035 0.4608959702286786  26.823981621115166  0.6859999999999999  0.469   516.0   591.0   49.224806201550386  53.46869712351946   51.346751662534935  516.0
7   7   g6202.t1_0035_0035  g193.t1_0042_0042   0.4679458383555545  17.81312422445775   0.66    0.462   804.0   837.0   41.91542288557214   47.67025089605735   44.79283689081474   804.0
8   8   g60667.t1_0035_0035 g14327.t1_0042_0042 0.046056273155280165    0.13320612138898    0.122   0.067   348.0   408.0   56.89655172413793   55.392156862745104  56.1443542934415    348.0
9   9   g30148.t1_0035_0042 g37790.t1_0042_0035 0.05631607180881047 0.19747150378706246 0.12300000000000001 0.08800000000000001 405.0   320.0   59.012345679012356  58.4375 58.72492283950618   320.0
10  10  g24481.t1_0035_0035 g37405.t1_0042_0035 0.2151957757290965  0.15106487998618026 0.135   0.17600000000000002 270.0   276.0   51.111111111111114  51.44927536231884   51.28019323671497   270.0
11  11  g33270.t1_0035_0035 g21201.t1_0042_0035 0.2773062983971916  21.13839474189674   0.6940000000000001  0.401   297.0   357.0   54.882154882154886  50.42016806722689   52.65116147469089   297.0
12  12  EOG090X03YJ_0035_0035_1 EOG090X03YJ_0042_0042_1 0.5402471721616758  19.278839157918302  0.7070000000000001  0.488   1321.0  1719.0  38.53141559424678   43.92088423502036   41.22614991463357   1321.0
13  13  g13075.t1_0035_0042 g504.t1_0042_0035   0.3317504066721263  4.790120127840871   0.65    0.38799999999999996 372.0   408.0   59.40860215053763   51.470588235294116  55.43959519291587   372.0
14  14  g1026.t1_0035_0035  g7716.t1_0042_0042  0.21445770772761286 13.92799368027682   0.626   0.344   336.0   315.0   38.095238095238095  44.444444444444436  41.26984126984127   315.0
15  15  g18238.t1_0035_0042 g35401.t1_0042_0035 0.3889830456691637  20.33679494952895   0.6759999999999999  0.44799999999999995 320.0   366.0   50.9375 49.453551912568315  50.19552595628416   320.0

DF3:

    Unnamed: 0 gene scaf_name   start   end cov_depth   GC
179806  g13600.t1_0042_0042 scaffold_6556   1   1149    2.42361684558216    0.528846153846154
315037  g34744.t1_0042_0035 scaffold_8076   17  765 3.49803921568627    0.386138613861386
317296  g35222.t1_0042_0035 scaffold_9018   1   614 93.071661237785 0.41
183513  g14327.t1_0042_0042 scaffold_9358   122 529 3.3184165232357996  0.36
328164  g37790.t1_0042_0035 scaffold_16356  1   320 2.73125 0.436241610738255
326617  g37405.t1_0042_0035 scaffold_14890  1   341 1.3061224489795902  0.36898395721925104
188515  g15510.t1_0042_0042 scaffold_20183  1   276 137.326086956522    0.669354838709677
184561  g14562.t1_0042_0042 scaffold_10427  1   494 157.993927125506    0.46145940390544704
290684  g30982.t1_0042_0035 scaffold_3800   440 940 174.499839537869    0.39823008849557506
179993  g13632.t1_0042_0042 scaffold_6654   29  1114    3.56506849315068    0.46153846153846206
181670  g13942.t1_0042_0042 scaffold_7830   1   811 5.307028360049321   0.529411764705882
196148  g20290.t1_0042_0035 scaffold_1145   2707    9712    78.84112231766741   0.367283950617284
313624  g34464.t1_0042_0035 scaffold_7610   1   480 7.740440324449589   0.549019607843137
303133  g32700.t1_0042_0035 scaffold_5119   1735    2373    118.436578171091    0.49074074074074103

1 个答案:

答案 0 :(得分:1)

这应该这样做:

df4 = df1[df1['Seq_1.id'].isin(df2['Seq_1.id'])&df1['Seq_2.id'].isin(df3['Seq_2.id'])]
df4
#  Seq_1.id Seq_2.id
#0   seq1_A   seq8_B
#1   seq2_A   Seq9_B

修改

你必须进行排列,这不会返回空白:

df4 = dN_dS[(dN_dS['seq1_id'].isin(candidates_0035['gene']))&(dN_dS['seq2_id'].isin(candidates_0042['gene']))]