通过操作两个词典

时间:2018-04-10 11:38:30

标签: python pandas dictionary joblib

我尝试从一个字典创建一个带有row的pandas数据帧,而从另一个字典创建column,并通过任何操作定义row [i] -column [j]的值在两个字典的键值对上(即row[dict1[key]]-column[dict2[key]]的值可以从接受值的函数计算:dict1 [key]和dict2 [key])。

到目前为止,我的代码看起来像这样:

# -*- coding: utf-8 -*-
import sys
import os
import pandas as pd
from optparse import OptionParser
from sklearn.preprocessing import MinMaxScaler
from joblib import Parallel, delayed
import pybedtools
from subprocess import call
from collections import defaultdict
import numpy as np
from skbio.sequence import DNA
from skbio.alignment import local_pairwise_align_ssw
class sequenceCompare:

class sequenceCompare:

    '''Common class for comparing multifasta files'''

    def __init__(
        self,
        fasta1,
        fasta2
        ):
        self.fasta1 = fasta1
        self.fasta2 = fasta2

    def computeScore(self):
        sequenceList1 = {}
        sequenceList2 = {}
        score_matrix = pd.DataFrame([])
        with open(self.fasta1) as file_one:
            sequenceList1 = {line.strip(">\n"):next(file_one).rstrip() for line in file_one}        
        with open(self.fasta2) as file_two:
            sequenceList2 = {line.strip(">\n"):next(file_two).rstrip() for line in file_two} 
        #Is there any way to make following step parallel 
        for key1, value1 in sequenceList1.items():
            for key2, value2 in sequenceList2.items():
                    alignment, score, start_end_positions = local_pairwise_align_ssw(DNA(value1), DNA(value2))
                    #Store value of score in dataframe column key1 and row key2

EG。

Sequence list 1: 
>A1
AAACCTTGGG
>A2
CCCAAAATTT
>A3
CCTTAAGGG

Sequence list 2:
>B1
GGTTAACC
>B2
GATCATCCA
>B3
CCAAAATTC

对两个词典进行操作后得到的数据帧应如下所示:

Dataframe: 
       A1          A2          A3
B1 dist(A1,B1) dist(A2,B1) dist(A3,B1)
B2 dist(A1,B2) dist(A2,B2) dist(A3,B2)
B3 dist(A1,B3) dist(A2,B3) dist(A3,B3)

最有效(并且希望并行)的方法是什么?

2 个答案:

答案 0 :(得分:0)

这段代码可以解决问题:

dict1 = {'B2': 'GATCATCCA', 'B3': 'CCAAAATTC', 'B1': 'GGTTAACC'}
dict2 = {'A2': 'CCCAAAATTT', 'A3': 'CCTTAAGGG', 'A1': 'AAACCTTGGG'}
finaldict = {}
for key1, value1 in dict1.items():
    for key2, value2 in dict2.items():
        #--> apply func taking (value2, value1) as input <..#
        try:
            finaldict[key2].update({key1:func})
        except:
            finaldict[key2] = {key1:func}
pd.DataFrame(finaldict)

答案 1 :(得分:0)

检查文档,似乎更有效的是构建StripedSmithWaterman对象并多次使用它而不是每次都使用local_pairwise_align_ssw。但是,它似乎没有提供并行性(这很奇怪,因为the library on which it is based声称实现SIMD parallelism,所以我可能错了),但你可以使用常规Python multiprocessing来并行化:

# -*- coding: utf-8 -*-
import sys
import os
import pandas as pd
from optparse import OptionParser
from sklearn.preprocessing import MinMaxScaler
from joblib import Parallel, delayed
import pybedtools
from subprocess import call
from multiprocessing import Pool
from itertools import repeat
from collections import defaultdict
import numpy as np
from skbio.sequence import DNA
from skbio.alignment import StripedSmithWaterman


def compute_scores(dna1, dnas2):
    # StripedSmithWaterman docs:
    # http://scikit-bio.org/docs/0.4.2/generated/skbio.alignment.StripedSmithWaterman.html
    ssw1 = StripedSmithWaterman(dna1)
    # AlignmentStructure docs:
    # http://scikit-bio.org/docs/0.4.2/generated/skbio.alignment.AlignmentStructure.html
    return [ssw1(dna2).optimal_alignment_score for dna2 in dnas2]

class sequenceCompare:

    '''Common class for comparing multifasta files'''

    def __init__(
        self,
        fasta1,
        fasta2
        ):
        self.fasta1 = fasta1
        self.fasta2 = fasta2

    def computeScore(self):
        sequenceList1 = {}
        sequenceList2 = {}
        score_matrix = pd.DataFrame([])
        with open(self.fasta1) as file_one:
            sequenceList1 = {line.strip(">\n"):next(file_one).rstrip() for line in file_one}
        with open(self.fasta2) as file_two:
            sequenceList2 = {line.strip(">\n"):next(file_two).rstrip() for line in file_two}
        with Pool(os.cpu_count()) as p:
            values2 = list(sequenceList2.values())
            data = p.starmap(compute_scores, zip(sequenceList1.values(), repeat(values2)))
            df = pd.DataFrame(data, columns=list(sequenceList1.keys()), index=list(sequenceList2.keys()))
            # df contains the resulting data frame