对于包含不同分子的文件,我有许多对值(键合原子对)。如果两对具有共同成员,则意味着它们是同一分子的一部分。我试图在python中找到一种有效的方法,根据它们所属的分子对原子进行分组。
例如,乙烷和甲烷将是:
1,5
和9
将是碳,其余为氢
[[1,2],[1,3],[1,4],[1,5],[5,6],[5,7],[5,8],[9,10],[9,11],[9,12],[9,13]]
我想获得一个列表/数组:
[[1,2,3,4,5,6,7,8],[9,10,11,12,13]]
我已经尝试过几种方法,但它们对于具有大量原子的文件实际上是无效的。应该有一个聪明的方法,但我找不到它。有什么想法吗?
谢谢, 琼
答案 0 :(得分:1)
如果我理解正确,你要做的是识别图的连通分量,其中每个节点都是一个原子,每个边是一个键(因此,一个连接的组件是一个分子)。在scipy.sparse.csgraph
中有一个有效的实现。
首先让我们将图形设置为稀疏矩阵:
import scipy.sparse as sps
# Input as provided
edges = [[1,2],[1,3],[1,4],[1,5],[5,6],[5,7],[5,8],[9,10],[9,11],[9,12],[9,13]]
# Modify the input by adding, for each [x,y], also [y,x].
# Also transform it to a set and then again to a list
# to assure that we don't duplicate anything.
edges = list({(x[0],x[1]) for x in edges}.union({(x[1],x[0]) for x in edges}))
# Create it as a matrix. The weights of all edges are set to 1,
# as they don't matter anyway.
graph = sps.csr_matrix(([1]*len(edges), np.array(edges).T))
此时,只需调用scipy.sparse.csgraph.connected_components
,但默认情况下输出的格式略有不同:
(3,数组([0,1,1,1,1,1,1,1,1,2,2,2,2,2)))
所以让我们稍微修改一下:
from scipy.sparse import csgraph
connected_components = csgraph.connected_components(graph)
result = []
for u in range(1, connected_components[0]):
result.append(np.where(connected_components[1]==u)[0])
result
[array([1,2,3,4,5,6,7,8],dtype = int64),
数组([9,10,11,12,13],dtype = int64)]
同样请注意,在range
我从1开始,因为Python标准从0开始计算,因为从1开始,这将被视为一个孤立的节点。如果原子的编号是非连续的,需要跳过孤立的节点,例如:
result = [r for r in result if len(r) > 1]
答案 1 :(得分:0)
bigArr = [[1,2],[1,3],[1,4],[1,5],[5,6],[5,7],[5,8],[9,10],[9,11],[9,12],[9,13]] ## Your list of pairs of values
molArr = []
for pair in bigArr:
flag = False
for molecule in molArr:
if pair[0] in molecule or pair[1] in molecule: ## Add both values if any of them are in the molecules list
molecule.append(pair[0])
molecule.append(pair[1])
flag = True ## The values have been added to an existing list
if not flag: ## The values weren't in an existing list so add them both
molArr.append(pair)
i = 0
for i in range(len(molArr)): ## Remove duplicates in one loop
molArr[i] = list(set(molArr[i]))
答案 2 :(得分:0)
这是另一种方法:
a = [[1,2],[1,3],[1,4],[1,5],[5,6],[5,7],[5,8],[9,10],[9,11],[9,12],[9,13]]
result = []
for sub in a:
join = False
for i, r in enumerate(result):
if any([x in r for x in sub]):
join = True
index = i
if join:
result[index] += [y for y in sub if y not in result[index]]
else:
result.append(sub)
result
#[[1,2,3,4,5,6,7,8],[9,10,11,12,13]]