如何以表格格式输出CoordinateMatrix?

时间:2017-03-17 10:49:29

标签: apache-spark apache-spark-mllib

我需要生成一个movielens评级数据子集的输出表。我已将我的数据帧转换为CoordinateMatrix:

from pyspark.mllib.linalg.distributed import MatrixEntry, CoordinateMatrix

mat = CoordinateMatrix(ratings.map( 
        lambda r: MatrixEntry(r.user, r.product, r.rating)))

但是,我无法看到如何以表格格式打印输出。我可以打印条目:

mat.entries.collect()

哪个输出:

[MatrixEntry(1, 1, 5.0),
 MatrixEntry(5, 6, 2.0),
 MatrixEntry(6, 1, 4.0),
 MatrixEntry(7, 6, 4.0),
 MatrixEntry(8, 1, 4.0),
 MatrixEntry(8, 4, 3.0),
 MatrixEntry(9, 1, 5.0)]

但是,我希望输出:

      1   2   3   4   5   6   7   8   9 
    ------------------------------------- ...
 1  | 5
 2  | 
 3  |
 4  | 
 5  |                     2
    ...

更新

pandas等价物是pivot_table,例如

import pandas as pd
import numpy as np
import os
import requests
import zipfile

np.set_printoptions(precision=4)

filename = 'ml-1m.zip'
if not os.path.exists(filename):
    r = requests.get('http://files.grouplens.org/datasets/movielens/ml-1m.zip', stream=True)
    if r.status_code == 200:
        with open(filename, 'wb') as f:
            for chunk in r:
                f.write(chunk)           
    else:
        raise 'Could not save dataset'

zip_ref = zipfile.ZipFile('ml-1m.zip', 'r')
zip_ref.extractall('.')
zip_ref.close()

ratingsNames = ["userId", "movieId", "rating", "timestamp"]
ratings = pd.read_table("./ml-1m/ratings.dat", header=None, sep="::", names=ratingsNames, engine='python')

ratingsMatrix = ratings.pivot_table(columns=['movieId'], index =['userId'], values='rating', dropna = False)

ratingsMatrix = ratingsMatrix.fillna(0)

# we don't have space to print the full matrix, just show the first few cells
print(ratingsMatrix.ix[:9, :9]) 

哪个输出:

movieId    1    2    3    4    5    6    7    8    9
userId                                              
1        5.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
2        0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
3        0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
4        0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
5        0.0  0.0  0.0  0.0  0.0  2.0  0.0  0.0  0.0
6        4.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
7        0.0  0.0  0.0  0.0  0.0  4.0  0.0  0.0  0.0
8        4.0  0.0  0.0  3.0  0.0  0.0  0.0  0.0  0.0
9        5.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0

0 个答案:

没有答案