Python将数据集的去规范化设置为像视图一样的矩阵

时间:2014-12-15 15:07:46

标签: python matrix normalization denormalization

我有一个规范化(DB明智)数据集,其中我有3列(~5000行),例如:

user        phrase  tfw
517187571   able    1
517187571   abroad  0.4
1037767202  abuse   0.272727
517187571   accuse  0.8
803230586   acknowledge 0.4
...

我需要将其转换为类似于视图的矩阵,其中行将是用户,列将是短语,并且在矩阵内部将在适当的行/列索引中具有tfw的值。 任何人都有任何明智的想法如何在python中有效地做到这一点? 所需的输出将是(对于上面的例子):

user/phrase   able   abroad   abuse    accuse   acknowledge
517187571     1      0        0        0        0
517187571     0      0.4      0        0        0
1037767202    0      0        0.272727 0        0
...

我尝试在SQL查询中对MySQL DB进行操作,并提出了这个天才查询不起作用:

SELECT
CONCAT('SELECT user,',
GROUP_CONCAT(sums),
' FROM clustering_normalized_dataset GROUP BY user')
FROM (
 SELECT CONCAT('SUM(phrase=\'', phrase, '\') AS `', phrase, '`') sums
 FROM clustering_normalized_dataset
 GROUP BY phrase
 ORDER BY COUNT(*) DESC
 ) s
INTO @sql;

PREPARE stmt FROM @sql;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;

2 个答案:

答案 0 :(得分:0)

使用库pandas,这是一个带有简单轴的单线性。

data = [
[517187571,   "able",1],
[517187571,   "abroad",  0.4],
[1037767202,  "abuse",   0.272727],
[517187571,   "accuse",  0.8],
[803230586,   "acknowledge", 0.4]]

import pandas as pd
df = pd.DataFrame(data,columns=("user","phrase","tfw"))
print df.pivot("user","phrase","tfw")

这给出了

phrase      able  abroad     abuse  accuse  acknowledge
user                                                   
517187571      1     0.4       NaN     0.8          NaN
803230586    NaN     NaN       NaN     NaN          0.4
1037767202   NaN     NaN  0.272727     NaN          NaN

用0.0替换Nan是微不足道的,但有时将它们留在表示你没有该项的数据是很好的。无论如何,您总是可以总结有效范围。与其他方法相比,巨大的优势就像您建议的那样,额外的数据不会存储在内存中。

答案 1 :(得分:0)

5000行实际上并不是那么多数据。你想要一个NxM矩阵,其中N是len(distinct())。

这有点蛮力,但我可能会构建填充0的矩阵,然后扫描主列表以插入您拥有的所有额外数据。

让我们假设你刚刚从db中取出所有原始数据到python

raw = [
    [517187571, 'able', 1],
    [517187571, 'abroad', 0.4],
    [1037767202, 'abuse', 0.272727],
    [517187571, 'accuse', 0.8],
    [803230586, 'acknowledge', .4],
    ...
]

# find our row / column titles
users = sorted(set(r[0] for r in raw))
words = sorted(set(r[1] for r in raw))

# indexes so we can see which position in the matrix belongs to a given word / user
user_to_pos = {u:i for i, u in enumerate(users)}
word_to_pos = {u:i for i, u in enumerate(words)}

# make the empty matrix
matrix = []
for u in users:
    matrix.append([0] * len(words))

for user, word, tfw in raw:
    matrix[user_to_pos[user]][word_to_pos[word]] = tfw

如果你使用numpy,你可以更快地构建那个矩阵,如果你使用pandas,你可以让它为你做列名(取决于你之后做的事情,值得研究那些库)。