Question

我正在使用MySQLdb和Python。我有一些基本的查询，例如：

c=db.cursor()
c.execute("SELECT id, rating from video")
results = c.fetchall()

我需要“结果”作为NumPy数组，而且我希望我的内存消耗更经济。似乎逐行复制数据会非常低效（需要双倍的内存）。有没有更好的方法将MySQLdb查询结果转换为NumPy数组格式？

我希望使用NumPy数组格式的原因是因为我希望能够轻松地对数据进行切片和切块，并且看起来python在这方面对多维数组非常友好。

e.g. b = a[a[:,2]==1]

谢谢！

Answer 1

此解决方案使用Kieth的 fromiter 技术，但更直观地处理SQL结果的二维表结构。此外，它通过避免python数据类型中的所有重新整形和展平来改进Doug的方法。使用structured array，我们几乎可以直接从MySQL结果读到numpy，完全删除几乎的python数据类型。我说“差不多”因为 fetchall 迭代器仍然会产生python元组。

但有一点需要注意，但这不是一个大问题。您必须事先知道列的数据类型和行数。

知道列类型应该是显而易见的，因为你可能知道查询是什么，否则你总是可以使用curs.description，以及MySQLdb.FIELD_TYPE。*常量的映射。

了解行数意味着您必须使用客户端游标（这是默认值）。我对MySQLdb和MySQL客户端库的内部结构还不太了解，但我的理解是，当使用客户端游标时，整个结果被提取到客户端内存中，尽管我怀疑实际上存在一些缓冲和缓存。这意味着为结果使用双重内存，一次用于光标复制，一次用于数组复制，因此如果结果集很大，最好尽快关闭光标以释放内存。

严格地说，您不必提前提供行数，但这样做意味着数组内存会提前一次性分配，并且不会随着更多行从迭代器进入而不能持续调整大小。提供巨大的性能提升。

有了这个，一些代码

import MySQLdb import numpy conn = MySQLdb.connect(host='localhost', user='bob', passwd='mypasswd', db='bigdb') curs = conn.cursor() #Use a client side cursor so you can access curs.rowcount numrows = curs.execute("SELECT id, rating FROM video") #curs.fecthall() is the iterator as per Kieth's answer #count=numrows means advance allocation #dtype='i4,i4' means two columns, both 4 byte (32 bit) integers A = numpy.fromiter(curs.fetchall(), count=numrows, dtype=('i4,i4')) print A #output entire array ids = A['f0'] #ids = an array of the first column #(strictly speaking it's a field not column) ratings = A['f1'] #ratings is an array of the second colum

有关如何指定列数据类型和列名的详细信息，请参阅dtype的numpy文档和上面有关结构化数组的链接。

Answer 2

fetchall方法实际上返回一个迭代器，numpy使用fromiter方法从一个interator初始化一个数组。因此，根据表中的数据，您可以轻松地将两者合并，或使用适配器生成器。

Answer 3

NumPy的 fromiter 方法在这里看起来最好（如Keith的回答，在此之前）。

使用 fromiter 将通过调用MySQLdb游标方法返回的结果集重铸为NumPy数组很简单，但有一些细节可能值得一提。

import numpy as NP
import MySQLdb as SQL

cxn = SQL.connect('localhost', 'some_user', 'their_password', 'db_name')
c = cxn.cursor()
c.execute('SELECT id, ratings from video')

# fetchall() returns a nested tuple (one tuple for each table row)
results = cursor.fetchall()

# 'num_rows' needed to reshape the 1D NumPy array returend by 'fromiter' 
# in other words, to restore original dimensions of the results set
num_rows = int(c.rowcount)

# recast this nested tuple to a python list and flatten it so it's a proper iterable:
x = map(list, list(results))              # change the type
x = sum(x, [])                            # flatten

# D is a 1D NumPy array
D = NP.fromiter(iterable=x, dtype=float, count=-1)  

# 'restore' the original dimensions of the result set:
D = D.reshape(num_rows, -1)

请注意 fromiter 会返回 1D NumPY数组，

（当然，这是有道理的，因为您可以通过传递 count 的参数来使用 fromiter 返回单个MySQL Table行的一部分）。

但是，您必须恢复2D形状，因此对游标方法 rowcount 的谓词调用。以及随后在最后一行中调用重塑。

最后，参数 count 的默认参数是'-1'，它只检索整个可迭代

将MySQL结果集转换为NumPy数组的最有效方法是什么？

3 个答案: