我有一个带有文档ID doc_id
的DataFrame,每个文档line_id
中一组行的行ID,以及每行vectors
的密集矢量表示。对于每个文档(doc_id
,我想将代表每一行的向量集转换为mllib.linalg.distributed.BlockMatrix
首先将向量转换为RDD doc_id
,然后将整个DataFrame的向量或BlockMatrix
过滤的DataFrame的向量转换为(numRows, numCols), DenseMatrix)
是相对简单的。下面的代码示例。
但是,我在将Iterator[(numRows, numCols), DenseMatrix)]
返回的mapPartition
的RDD转换到每个doc_id
分区的vectors列转换成每个单独的BlockMatrix
时遇到麻烦doc_id
个分区。
我的集群有3个工作节点,每个工作节点有16个内核和62 GB的内存。
导入并启动火花
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.mllib.random import RandomRDDs
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.linalg import VectorUDT
from pyspark.mllib.linalg import Matrices
from pyspark.mllib.linalg import MatrixUDT
from pyspark.mllib.linalg.distributed import BlockMatrix
spark = (
SparkSession.builder
.master('yarn')
.appName("linalg_test")
.getOrCreate()
)
创建测试数据框
nRows = 25000
""" Create ids dataframe """
win = (W
.partitionBy(F.col('doc_id'))
.rowsBetween(W.unboundedPreceding, W.currentRow)
)
df_ids = (
spark.range(0, nRows, 1)
.withColumn('rand1', (F.rand(seed=12345) * 50).cast(T.IntegerType()))
.withColumn('doc_id', F.floor(F.col('rand1')/3).cast(T.IntegerType()) )
.withColumn('int', F.lit(1))
.withColumn('line_id', F.sum(F.col('int')).over(win))
.select('id', 'doc_id', 'line_id')
)
""" Create vector dataframe """
df_vecSchema = T.StructType([
T.StructField('vectors', T.StructType([T.StructField('vectors', VectorUDT())] ) ),
T.StructField('id', T.LongType())
])
vecDim = 50
df_vec = (
spark.createDataFrame(
RandomRDDs.normalVectorRDD(sc, numRows=nRows, numCols=vecDim, seed=54321)
.map(lambda x: Row(vectors=Vectors.dense(x),))
.zipWithIndex(), schema=df_vecSchema)
.select('id', 'vectors.*')
)
""" Create final test dataframe """
df_SO = (
df_ids.join(df_vec, on='id', how='left')
.select('doc_id', 'line_id', 'vectors')
.orderBy('doc_id', 'line_id')
)
numDocs = df_SO.agg(F.countDistinct(F.col('doc_id'))).collect()[0][0]
# numDocs = df_SO.groupBy('doc_id').agg(F.count(F.col('line_id'))).count()
df_SO = df_SO.repartition(numDocs, 'doc_id')
RDD函数可在Vector列之外创建矩阵
def vec2mat(row):
return (
(row.line_id-1, 0),
Matrices.dense(1, vecDim, (row.vectors.toArray().tolist())), )
从每个line_id
向量中创建密集矩阵
mat = df_SO.rdd.map(vec2mat)
从DenseMatrix的RDD创建分布式BlockMatrix
blk_mat = BlockMatrix(mat, 1, vecDim)
检查输出
blk_mat
<pyspark.mllib.linalg.distributed.BlockMatrix at 0x7fe1da370a50>
blk_mat.blocks.take(1)
[((273, 0),
DenseMatrix(1, 50, [1.749, -1.4873, -0.3473, 0.716, 2.3916, -1.5997, -1.7035, 0.0105, ..., -0.0579, 0.3074, -1.8178, -0.2628, 0.1979, 0.6046, 0.4566, 0.4063], 0))]
将doc_id
的每个分区转换为mapPartitions
后,我无法获得相同的效果。 mapPartitions
函数有效,但是我无法将它返回的RDD转换为BlockMatrix
。
RDD函数可从每个line_id
分区的每个doc_id
向量中分别创建密集矩阵
def vec2mat_p(iter):
yield [((row.line_id-1, 0),
Matrices.dense(1, vecDim, (row.vectors.toArray().tolist())), )
for row in iter]
分别从每个line_id
分区的每个doc_id
向量中创建密集矩阵
mat_doc = df_SO.rdd.mapPartitions(vec2mat_p, preservesPartitioning=True)
检查
mat_doc
PythonRDD[4991] at RDD at PythonRDD.scala:48
mat_test.take(1)
[[((0, 0),
DenseMatrix(1, 50, [1.814, -1.1681, -2.1887, -0.5371, -0.7509, 2.3679, 0.2795, 1.4135, ..., -0.3584, 0.5059, -0.6429, -0.6391, 0.0173, 1.2109, 1.804, -0.9402], 0)),
((1, 0),
DenseMatrix(1, 50, [0.3884, -1.451, -0.0431, -0.4653, -2.4541, 0.2396, 1.8704, 0.8471, ..., -2.5164, 0.1298, -1.2702, -0.1286, 0.9196, -0.7354, -0.1816, -0.4553], 0)),
((2, 0),
DenseMatrix(1, 50, [0.1382, 1.6753, 0.9563, -1.5251, 0.1753, 0.9822, 0.5952, -1.3924, ..., 0.9636, -1.7299, 0.2138, -2.5694, 0.1701, 0.2554, -1.4879, -1.6504], 0)),
...]]
检查类型
(mat_doc
.filter(lambda p: len(p) > 0)
.map(lambda mlst: [(type(m[0]), (type(m[0][0]),type(m[0][1])), type(m[1])) for m in mlst] )
.first()
)
[(tuple, (int, int), pyspark.mllib.linalg.DenseMatrix),
(tuple, (int, int), pyspark.mllib.linalg.DenseMatrix),
(tuple, (int, int), pyspark.mllib.linalg.DenseMatrix),
...]
似乎正确,但是正在运行:
(mat_doc
.filter(lambda p: len(p) > 0)
.map(lambda mlst: [BlockMatrix((m[0], m[1])[0], 1, vecDim) for m in mlst] )
.first()
)
导致以下类型错误:
TypeError: blocks should be an RDD of sub-matrix blocks as ((int, int), matrix) tuples, got
不幸的是,该错误停止了很短的时间,并没有告诉我它是什么'
此外,我无法在sc.parallelize()
通话中致电map()
。
如何将mapPartitions
返回的RDD迭代器中的每个项目转换为BlockMatrix
将接受的RDD?