我需要使用python3计算数据框中存在的所有可能接触点之间的距离。我的代码正在运行,但是非常慢。如何减少时间?
条件:如果chr字符串在数据框中相同。然后将计算所有基因之间的距离。 Chr1存在于行索引1,3,5中,因此将形成三个组合:
1)MK1 MI5 40(62〜17)
2)MK1 MR4 9(62〜51)
3)MI5 MR4 34(17〜51)
在索引1和2中同样存在chr2。因此,唯一的组合将是LC1 LI6 18(16〜34)。
INPUT
chr st st1 gene
0 chr1 62 62 MK1
1 chr2 16 16 LC1
2 chr2 34 34 LI6
3 chr1 17 17 MI5
4 chr3 15 15 LI6
5 chr1 51 51 MR4
OUTPUT
gene1 gene2 dist
MK1 MI5 45
MK1 MR4 11
MI5 MK1 45
```
data1 = {'chr': ['chr1','chr2','chr2','chr1','chr3','chr1'],
'st': [62,16,34,17,15,51],
'st1': [62,16,34,17,15,51],
'gene':['MK1','LC1','LI6','MI5','LI6','MR4']}
data = pd.DataFrame(data1)
chr = pd.Series(data.chr.unique())
cols = ['gen1','gen2','dist']
all_dist_comb = pd.DataFrame()
for chr_num in chr:
for i in range(0,len(data)):
for j in range(0,len(data)):
if i != j and chr_num == data.iloc[i,0] and chr_num == data.iloc[j,0]:
dist = abs(data.iloc[i, 1] - data.iloc[j, 1])
all_dist_comb=all_dist_comb.append(pd.Series([data.iloc[i,3],data.iloc[j,3],dist],index={'gene1':'str','gene2':'str','dist':int}),ignore_index=True)
all_dist_comb=all_dist_comb.reindex(columns=['gene1', 'gene2', 'dist'])
all_dist_comb['dist'] = all_dist_comb['dist'].astype(int)
print(all_dist_comb)
```