从MySQL到Python创建计数向量的有效方法

时间:2013-08-04 11:41:53

标签: python mysql

我在MySQL中有这些数据(样本):

表1:

ID     ITEM    CNT
--------------------
0001    AAB     5
0001    BBA     3
0001    BBB     8
0001    AAC     10
0002    BBA     2
0002    BBC     7
0003    FFG     2
0003    JPO     4
0003    PUI     22
..........

我想找到一种以计数向量的形式在Python中导入这些数据的方法,例如:

0001 = [5,10,3,8,0,0,0,0]
0002 = [0,0,2,0,7,0,0,0]
0003 = [0,0,0,0,0,0,4,22]

其中元素表示此表单中每个ID的所有项目的计数:[AAB,AAC,BBA,BBB,BBC,FFG,JPO,PUI]

所以我想问一下,实现这个的最佳和最有效的方法是什么?从python或mysql做它更好吗?

谢谢

2 个答案:

答案 0 :(得分:1)

通常更有效 - 在可能的情况下 - 在SQL中而不是在Python中操作数据。

使用此设置:

import config
import MySQLdb
conn = MySQLdb.connect(
    host=config.HOST, user=config.USER,
    passwd=config.PASS, db='test')
cursor = conn.cursor()

sql = '''\
DROP TABLE IF EXISTS foo 
'''
cursor.execute(sql)

sql = '''\
CREATE TABLE foo (
    ID varchar(4),
    ITEM varchar(3),
    CNT int)
'''

cursor.execute(sql)

sql = '''\
INSERT INTO foo VALUES (%s,%s,%s)
'''

cursor.executemany(sql, [['0001', 'AAB', 5],
                         ['0001', 'BBA', 3],
                         ['0001', 'BBB', 8],
                         ['0002', 'BBA', 2]])

您可以使用以下方法构建所需的SQL:

items = 'AAB AAC BBA BBB BBC FFG JPO PUI'.split()
fields = ', '.join('COALESCE({}.CNT, 0)'.format(item) for item in items)
joins = '\n'.join('''\
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = '{i}') as {i}
    ON T.ID = {i}.ID'''.format(i=item) for item in items)
sql = '''\
SELECT T.ID, {f}
FROM (SELECT DISTINCT ID from foo) as T
{j}
'''.format(f=fields, j=joins)

print(sql)

并像这样使用它:

result = dict()
cursor.execute(sql)
for row in cursor:
    result[row[0]] = row[1:]
print(result)    

使用的SQL查询是:

SELECT T.ID, COALESCE(AAB.CNT, 0), COALESCE(AAC.CNT, 0), COALESCE(BBA.CNT, 0), COALESCE(BBB.CNT, 0), COALESCE(BBC.CNT, 0), COALESCE(FFG.CNT, 0), COALESCE(JPO.CNT, 0), COALESCE(PUI.CNT, 0)
FROM (SELECT DISTINCT ID from foo) as T
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'AAB') as AAB
    ON T.ID = AAB.ID
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'AAC') as AAC
    ON T.ID = AAC.ID
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'BBA') as BBA
    ON T.ID = BBA.ID
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'BBB') as BBB
    ON T.ID = BBB.ID
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'BBC') as BBC
    ON T.ID = BBC.ID
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'FFG') as FFG
    ON T.ID = FFG.ID
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'JPO') as JPO
    ON T.ID = JPO.ID
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'PUI') as PUI
    ON T.ID = PUI.ID

结果dict看起来像:

{'0001': (5L, 0L, 3L, 8L, 0L, 0L, 0L, 0L), '0002': (0L, 0L, 2L, 0L, 0L, 0L, 0L, 0L)}

我知道你要求

0001 = [5,10,3,8,0,0,0,0]
0002 = [0,0,2,0,7,0,0,0]
0003 = [0,0,0,0,0,0,4,22]

但至少有两个问题。首先,0001不是有效的Python变量名。变量名不能以数字开头。其次,您不希望动态定义变量名称,因为很难使用裸变量名称进行编程,直到运行时才知道该名称。

相反,使用be-be变量名作为dict中的键result。然后,您可以使用0001引用“变量”result['0001']

答案 1 :(得分:0)

您可以通过交叉表查询进行访问,其中行标题将是id,列标题将是项目,cnt是要聚合的值。然后,您可以循环遍历每一行的每一列以获取向量。有关交叉表查询的帮助,请参阅此处:http://allenbrowne.com/ser-67.html