使用Python3

时间:2019-07-12 08:48:33

标签: python-3.x pandas

我需要使用python3计算数据框中存在的所有可能接触点之间的距离。我的代码正在运行,但是非常慢。如何减少时间?

条件:如果chr字符串在数据框中相同。然后将计算所有基因之间的距离。 Chr1存在于行索引1,3,5中,因此将形成三个组合:

1)MK1 MI5 40(62〜17)

2)MK1 MR4 9(62〜51)

3)MI5 MR4 34(17〜51)

在索引1和2中同样存在chr2。因此,唯一的组合将是LC1 LI6 18(16〜34)。

 INPUT
       chr   st  st1 gene
 0     chr1  62  62  MK1
 1     chr2  16  16  LC1
 2     chr2  34  34  LI6
 3     chr1  17  17  MI5
 4     chr3  15  15  LI6
 5     chr1  51  51  MR4

  OUTPUT
  gene1 gene2  dist
  MK1   MI5    45 
  MK1   MR4    11 
  MI5   MK1    45

 ```
 data1 = {'chr': ['chr1','chr2','chr2','chr1','chr3','chr1'],
    'st': [62,16,34,17,15,51],
    'st1': [62,16,34,17,15,51],
    'gene':['MK1','LC1','LI6','MI5','LI6','MR4']}
 data = pd.DataFrame(data1)
 chr = pd.Series(data.chr.unique())

 cols = ['gen1','gen2','dist']
 all_dist_comb = pd.DataFrame()

 for chr_num in chr:
      for i in range(0,len(data)):
         for j in range(0,len(data)):
             if i != j and chr_num == data.iloc[i,0] and chr_num == data.iloc[j,0]:
                 dist = abs(data.iloc[i, 1] - data.iloc[j, 1])
                 all_dist_comb=all_dist_comb.append(pd.Series([data.iloc[i,3],data.iloc[j,3],dist],index={'gene1':'str','gene2':'str','dist':int}),ignore_index=True)

 all_dist_comb=all_dist_comb.reindex(columns=['gene1', 'gene2', 'dist'])
 all_dist_comb['dist'] = all_dist_comb['dist'].astype(int)

 print(all_dist_comb)
 ```

0 个答案:

没有答案