我具有以下格式的数据框:
+----+-----+
| features |
+----+-----+
|[1,4,7,10]|
|[2,5,8,11]|
|[3,6,9,12]|
+----+----+
用于创建示例数据框的脚本:
rows2 = sc.parallelize([ IndexedRow(0, [1, 4, 7, 10 ]),
IndexedRow(1, [2, 5, 8, 1]),
IndexedRow(1, [3, 6, 9, 12]),
])
rows_df = rows2.toDF()
row_vec= rows_df.drop("index")
row_vec.show()
功能部件列包含4个功能部件,并且有3行ID。我想将此数据转换为rowmatrix,其中的行和列将采用以下mat格式:
from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([(1, 2, 3), (4, 5, 6), (7, 8, 9), (10, 11, 12)])
# Convert to RowMatrix
mat = RowMatrix(rows)
# Calculate exact and approximate similarities
exact = mat.columnSimilarities()
approx = mat.columnSimilarities(0.05)
基本上,我想将数据框转置为新格式,以便可以运行columnSimilarities()函数。我有一个更大的数据框,其中包含50个要素和39000行。
答案 0 :(得分:0)
这是您要尝试做的吗?讨厌使用collect()
,但是您想在这里避免这种情况,因为您想将结构化对象整形/转换为矩阵...对吧?
X = np.array(row_vec.select("_2").collect()).reshape(-1,3)
X = sc.parallelize(X)
for i in X.collect(): print(i)
[1 4 7]
[10 2 5]
[8 1 3]
[ 6 9 12]
答案 1 :(得分:0)
我知道了,我使用了以下内容:
from pyspark.mllib.linalg.distributed import RowMatrix
features_rdd = row_vec.select("features").rdd.map(lambda row: row[0])
features_mat = RowMatrix(features_rdd )
from pyspark.mllib.linalg.distributed import CoordinateMatrix, MatrixEntry
coordmatrix_features = CoordinateMatrix(
features_mat .rows.zipWithIndex().flatMap(
lambda x: [MatrixEntry(x[1], j, v) for j, v in enumerate(x[0])]
)
)
transposed_rowmatrix_features = coordmatrix_features.transpose().toRowMatrix()