根据fasta SeqId订单

时间:2018-05-16 09:06:52

标签: python pandas merge

我实际上有两个fasta文件candidate_aa_0042.fasta和candidates_aa_0035.fasta

和两个数据帧Best_blast_candidate_hit_0042.csv和Best_blast_candidate_hit_0035.csv

以下是它们的例子:

qseqid  sseqid  pident  length  mismatch    gapopen qstart  qend    sstart  send    evalue  bitscore    salltitles  staxids scientific_name scomnames   sskingdoms  Order
g44459.t1_0035_0035 XP_011687429.1  39.5    157 95  0   7   163 2   158 8.1e-27 129.8   uncharacterized protein LOC105449744 [Wasmannia auropunctata]   64793   Wasmannia auropunctata      Eukaryota   Hymenoptera
g17612.t1_0035_0042 XP_011699787.1  59.3    349 142 0   99  447 336 684 1.5e-120    442.6   uncharacterized protein LOC105457055 [Wasmannia auropunctata]   64793   Wasmannia auropunctata      Eukaryota   Hymenoptera
g29924.t1_0035_0042 XP_011871948.1  67.0    261 85  1   1   260 18  278 1.3e-100    375.6   uncharacterized protein LOC105564266, partial [Vollenhovia emeryi]  411798  Vollenhovia emeryi      Eukaryota   Hymenoptera
g47960.t1_0035_0035 XP_011860868.1  68.8    298 93  0   1   298 142 439 3.3e-116    427.6   uncharacterized protein LOC105558006 [Vollenhovia emeryi]   411798  Vollenhovia emeryi      Eukaryota   Hymenoptera
g28580.t1_0035_0042 XP_011883624.1  70.0    240 69  3   1   239 41  278 1.3e-86 328.9   uncharacterized protein LOC105570787 [Vollenhovia emeryi]   411798  Vollenhovia emeryi      Eukaryota   Hymenoptera

qseqid  sseqid  pident  length  mismatch    gapopen qstart  qend    sstart  send    evalue  bitscore    salltitles  staxids scientific_name scomnames   sskingdoms  Order
g34354.t1_0042_0035 XP_011699801.1  43.7    135 63  4   7   128 625 759 9.3e-17 96.3    LOW QUALITY PROTEIN 64793   Wasmannia auropunctata      Eukaryota   Hymenoptera
g34606.t1_0042_0035 XP_011871948.1  59.8    249 79  2   1   228 51  299 3.4e-81 310.8   uncharacterized protein LOC105564266, partial [Vollenhovia emeryi]  411798  Vollenhovia emeryi      Eukaryota   Hymenoptera
g13215.t1_0042_0042 XP_011883625.1  62.0    242 92  0   46  287 160 401 5.4e-82 313.9   uncharacterized protein LOC105570788, partial [Vollenhovia emeryi]  411798  Vollenhovia emeryi      Eukaryota   Hymenoptera
g35379.t1_0042_0035 XP_011858260.1  73.3    191 51  0   4   194 690 880 6.3e-76 293.1   uncharacterized protein LOC105555830 [Vollenhovia emeryi]   411798  Vollenhovia emeryi      Eukaryota   Hymenoptera
g13770.t1_0042_0042 XP_011883624.1  66.5    203 65  3   10  211 33  233 1.9e-65 258.5   uncharacterized protein LOC105570787 [Vollenhovia emeryi]   411798  Vollenhovia emeryi      Eukaryota   Hymenoptera

我实际上必须以与fasta文件中的seqID相同的顺序合并它们。但

例如,如果fasta文件1包含:

>seq1_0035_0042
ATGGAGAGATAG
>seq6_0035_0035
ATGGATAGAGA

并且fasta文件2包含:

>seq8_0042_0042
ATGGAGAGATAG
>seq3_0042_0035
ATGGATAGAGA

然后我想按顺序合并我的数据框:

例如:

qseqid_1       qseqid_2       sseqid_1       sseqid_2       pident_1 pident_2 etc...
seq1_0035_0042 XP_011883678.1 seq8_0042_0042 XP_011883789.1   78.9   45.9 etc
seq6_0035_0035 XP_011566754.1 seq3_0042_0035 XP_011566754.1   67.9   78.0. etc

Ps:数据帧中不存在fasta文件中的所有SeqId,因此如果没有一对,我们可以在数据帧中添加它并在第二列添加一个Nan吗? 谢谢你的帮助:))

0 个答案:

没有答案