我需要生成一个movielens评级数据子集的输出表。我已将我的数据帧转换为CoordinateMatrix:
from pyspark.mllib.linalg.distributed import MatrixEntry, CoordinateMatrix
mat = CoordinateMatrix(ratings.map(
lambda r: MatrixEntry(r.user, r.product, r.rating)))
但是,我无法看到如何以表格格式打印输出。我可以打印条目:
mat.entries.collect()
哪个输出:
[MatrixEntry(1, 1, 5.0),
MatrixEntry(5, 6, 2.0),
MatrixEntry(6, 1, 4.0),
MatrixEntry(7, 6, 4.0),
MatrixEntry(8, 1, 4.0),
MatrixEntry(8, 4, 3.0),
MatrixEntry(9, 1, 5.0)]
但是,我希望输出:
1 2 3 4 5 6 7 8 9
------------------------------------- ...
1 | 5
2 |
3 |
4 |
5 | 2
...
更新
pandas等价物是pivot_table,例如
import pandas as pd
import numpy as np
import os
import requests
import zipfile
np.set_printoptions(precision=4)
filename = 'ml-1m.zip'
if not os.path.exists(filename):
r = requests.get('http://files.grouplens.org/datasets/movielens/ml-1m.zip', stream=True)
if r.status_code == 200:
with open(filename, 'wb') as f:
for chunk in r:
f.write(chunk)
else:
raise 'Could not save dataset'
zip_ref = zipfile.ZipFile('ml-1m.zip', 'r')
zip_ref.extractall('.')
zip_ref.close()
ratingsNames = ["userId", "movieId", "rating", "timestamp"]
ratings = pd.read_table("./ml-1m/ratings.dat", header=None, sep="::", names=ratingsNames, engine='python')
ratingsMatrix = ratings.pivot_table(columns=['movieId'], index =['userId'], values='rating', dropna = False)
ratingsMatrix = ratingsMatrix.fillna(0)
# we don't have space to print the full matrix, just show the first few cells
print(ratingsMatrix.ix[:9, :9])
哪个输出:
movieId 1 2 3 4 5 6 7 8 9
userId
1 5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0
6 4.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0.0
8 4.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0
9 5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0