我有一个规范化(DB明智)数据集,其中我有3列(~5000行),例如:
user phrase tfw
517187571 able 1
517187571 abroad 0.4
1037767202 abuse 0.272727
517187571 accuse 0.8
803230586 acknowledge 0.4
...
我需要将其转换为类似于视图的矩阵,其中行将是用户,列将是短语,并且在矩阵内部将在适当的行/列索引中具有tfw的值。 任何人都有任何明智的想法如何在python中有效地做到这一点? 所需的输出将是(对于上面的例子):
user/phrase able abroad abuse accuse acknowledge
517187571 1 0 0 0 0
517187571 0 0.4 0 0 0
1037767202 0 0 0.272727 0 0
...
我尝试在SQL查询中对MySQL DB进行操作,并提出了这个天才查询不起作用:
SELECT
CONCAT('SELECT user,',
GROUP_CONCAT(sums),
' FROM clustering_normalized_dataset GROUP BY user')
FROM (
SELECT CONCAT('SUM(phrase=\'', phrase, '\') AS `', phrase, '`') sums
FROM clustering_normalized_dataset
GROUP BY phrase
ORDER BY COUNT(*) DESC
) s
INTO @sql;
PREPARE stmt FROM @sql;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
答案 0 :(得分:0)
使用库pandas
,这是一个带有简单轴的单线性。
data = [
[517187571, "able",1],
[517187571, "abroad", 0.4],
[1037767202, "abuse", 0.272727],
[517187571, "accuse", 0.8],
[803230586, "acknowledge", 0.4]]
import pandas as pd
df = pd.DataFrame(data,columns=("user","phrase","tfw"))
print df.pivot("user","phrase","tfw")
这给出了
phrase able abroad abuse accuse acknowledge
user
517187571 1 0.4 NaN 0.8 NaN
803230586 NaN NaN NaN NaN 0.4
1037767202 NaN NaN 0.272727 NaN NaN
用0.0替换Nan
是微不足道的,但有时将它们留在表示你没有该项的数据是很好的。无论如何,您总是可以总结有效范围。与其他方法相比,巨大的优势就像您建议的那样,额外的数据不会存储在内存中。
答案 1 :(得分:0)
5000行实际上并不是那么多数据。你想要一个NxM矩阵,其中N是len(distinct())。
这有点蛮力,但我可能会构建填充0的矩阵,然后扫描主列表以插入您拥有的所有额外数据。
让我们假设你刚刚从db中取出所有原始数据到python
raw = [
[517187571, 'able', 1],
[517187571, 'abroad', 0.4],
[1037767202, 'abuse', 0.272727],
[517187571, 'accuse', 0.8],
[803230586, 'acknowledge', .4],
...
]
# find our row / column titles
users = sorted(set(r[0] for r in raw))
words = sorted(set(r[1] for r in raw))
# indexes so we can see which position in the matrix belongs to a given word / user
user_to_pos = {u:i for i, u in enumerate(users)}
word_to_pos = {u:i for i, u in enumerate(words)}
# make the empty matrix
matrix = []
for u in users:
matrix.append([0] * len(words))
for user, word, tfw in raw:
matrix[user_to_pos[user]][word_to_pos[word]] = tfw
如果你使用numpy,你可以更快地构建那个矩阵,如果你使用pandas,你可以让它为你做列名(取决于你之后做的事情,值得研究那些库)。