Question

我有2215个分子的列表，它们被编码为2048位向量。我正在尝试从中创建2D数组。我正在使用rdkit library转换为numpy数组。几周前代码工作正常，现在出现了内存错误，但我不知道为什么。谁能提供解决方案？

我试图将列表缩小，并将其缩小为两个向量。我以为会有所帮助，但是经过一段时间的处理后错误仍然会弹出。那使我相信我实际上确实有足够的记忆力。

# red_fp is the list of bit vectors

def rdkit_numpy_convert(red_fp):
    output = []
    for f in fp:
        arr = np.zeros((1,))
        DataStructs.ConvertToNumpyArray(f, arr)
        output.append(arr)
    return np.asarray(output)

# this one line causes the problem
x = rdkit_numpy_convert(red_fp)

这是错误：

MemoryError  Traceback (most recent call last)
MemoryError: cannot allocate memory for array

The above exception was the direct cause of the following exception:

SystemError  Traceback (most recent call last)
<ipython-input-14-91594513666c> in <module>
----> 1 x = rdkit_numpy_convert(red_fp)

<ipython-input-13-78d1c9fdd07e> in rdkit_numpy_convert(red_fp)
      4     for f in fp:
      5         arr = np.zeros((1,))
----> 6         DataStructs.ConvertToNumpyArray(f, arr)
      7         output.append(arr)
      8     return np.asarray(output)

SystemError: <Boost.Python.function object at 0x55a2a5743520> returned a result with an error set

Answer 1

我相信您的问题是您使用的指纹与此方法转换为numpy数组不兼容。

我不确定您使用的是哪种类型的指纹，但是假设您使用的是摩根指纹，我做了一些快速实验，当我使用“ GetMorganFingerprint”方法与“ GetMorganFingerprintAsBitVect”方法时，该方法似乎挂起。我不确定为什么会发生此问题，但我认为这是由于第一种方法会产生UIntSparseIntVect与ExplicitBitVect的事实，尽管我发现当我尝试使用“ GetHashedMorganFingerprint”产生的指纹时使用相同的方法时，该方法也会返回UIntSparseIntVect可以正常工作。

我建议您是否使用摩根指纹尝试“ GetMorganFingerprintAsBitVect”方法

编辑：

我做了几次实验

mol = Chem.MolFromSmiles('c1ccccc1')

fp = AllChem.GetMorganFingerprint(mol, 2)
print(fp.GetLength())
'4294967295'

fp1 = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
print(fp1.GetNumBits())
'2048'

fp2 = AllChem.GetHashedMorganFingerprint(mol, 2)
print(fp2.GetLength())
'2048'

如您所见，第一种方法的指纹很大，我最初的想法是该指纹处于展开状态，因此使用了稀疏的数据结构，这将解释您为何在尝试为以下对象分配内存时遇到问题这种尺寸的指纹。

Answer 2

这是我第一次听说rdkit，但这似乎是Boost代码的C++包装器。

从文档中，https://www.rdkit.org/docs/source/rdkit.DataStructs.cDataStructs.html

ConvertToNumpyArray的第二个参数是destArray。

rdkit.DataStructs.cDataStructs.ConvertToNumpyArray((ExplicitBitVect)bv, 
    (AtomPairsParameters)destArray) → None :¶

我的猜测是该函数试图将转换后的值放入destArray中。它并没有尝试自己分配新的内存（就像传统的numpy构造函数那样），而是只填充了给定的数组。

如果该猜测正确，则错误在于

arr = np.zeros((1,))

arr仅具有一个浮点数（8个字节）的空间。 arr必须足够大（右边的dtype）才能容纳Convert产生的结果。

是否有任何文档或示例说明了此转换的用法？当询问有关[rdkit]之类的低流量标签的问题时，如果您包含一些指向文档和示例代码的链接，将会很有帮助。

我瞥了一眼其他[rdkit]。

How can I compute a Count Morgan fingerprint as numpy.array?

建议我错了。接受的答案使用

np.zeros((0,), dtype=np.int8)

为其数据缓冲区分配0个字节。

另一个使用np.zeros((1,))

ValueError when doing validation with random forests

无法为数组分配内存，rdkit转换为numpy数组错误

2 个答案: