如何在mapPartitions返回的迭代器中的每个RDD上映射RDD函数

时间:2019-06-08 18:56:18

标签: apache-spark pyspark rdd apache-spark-mllib

我有一个带有文档ID doc_id的DataFrame,每个文档line_id中一组行的行ID,以及每行vectors的密集矢量表示。对于每个文档(doc_id,我想将代表每一行的向量集转换为mllib.linalg.distributed.BlockMatrix

首先将向量转换为RDD doc_id,然后将整个DataFrame的向量或BlockMatrix过滤的DataFrame的向量转换为(numRows, numCols), DenseMatrix)是相对简单的。下面的代码示例。

但是,我在将Iterator[(numRows, numCols), DenseMatrix)]返回的mapPartition的RDD转换到每个doc_id分区的vectors列转换成每个单独的BlockMatrix时遇到麻烦doc_id个分区。

我的集群有3个工作节点,每个工作节点有16个内核和62 GB的内存。


导入并启动火花

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.mllib.random import RandomRDDs
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.linalg import VectorUDT
from pyspark.mllib.linalg import Matrices
from pyspark.mllib.linalg import MatrixUDT
from pyspark.mllib.linalg.distributed import BlockMatrix

spark = (
    SparkSession.builder
    .master('yarn')
    .appName("linalg_test")
    .getOrCreate()
) 

创建测试数据框

nRows = 25000

""" Create ids dataframe """
win = (W
    .partitionBy(F.col('doc_id'))    
    .rowsBetween(W.unboundedPreceding, W.currentRow)
)

df_ids = (
    spark.range(0, nRows, 1)
    .withColumn('rand1', (F.rand(seed=12345) * 50).cast(T.IntegerType()))
    .withColumn('doc_id', F.floor(F.col('rand1')/3).cast(T.IntegerType()) )
    .withColumn('int', F.lit(1))
    .withColumn('line_id', F.sum(F.col('int')).over(win))
    .select('id', 'doc_id', 'line_id')
)

""" Create vector dataframe """
df_vecSchema = T.StructType([
    T.StructField('vectors', T.StructType([T.StructField('vectors', VectorUDT())] ) ), 
    T.StructField('id', T.LongType()) 
])

vecDim = 50
df_vec = (
    spark.createDataFrame(
        RandomRDDs.normalVectorRDD(sc, numRows=nRows, numCols=vecDim, seed=54321)
        .map(lambda x: Row(vectors=Vectors.dense(x),))
        .zipWithIndex(), schema=df_vecSchema)
    .select('id', 'vectors.*')
)

""" Create final test dataframe """
df_SO = (
    df_ids.join(df_vec, on='id', how='left')
    .select('doc_id', 'line_id', 'vectors')
    .orderBy('doc_id', 'line_id')
)

numDocs = df_SO.agg(F.countDistinct(F.col('doc_id'))).collect()[0][0]
# numDocs = df_SO.groupBy('doc_id').agg(F.count(F.col('line_id'))).count()

df_SO = df_SO.repartition(numDocs, 'doc_id')

RDD函数可在Vector列之外创建矩阵

def vec2mat(row):
    return ( 
        (row.line_id-1, 0), 
        Matrices.dense(1, vecDim, (row.vectors.toArray().tolist())), )

从每个line_id向量中创建密集矩阵

mat = df_SO.rdd.map(vec2mat)

从DenseMatrix的RDD创建分布式BlockMatrix

blk_mat = BlockMatrix(mat, 1, vecDim)

检查输出

blk_mat
<pyspark.mllib.linalg.distributed.BlockMatrix at 0x7fe1da370a50>
blk_mat.blocks.take(1)
[((273, 0),
  DenseMatrix(1, 50, [1.749, -1.4873, -0.3473, 0.716, 2.3916, -1.5997, -1.7035, 0.0105, ..., -0.0579, 0.3074, -1.8178, -0.2628, 0.1979, 0.6046, 0.4566, 0.4063], 0))]

问题

doc_id的每个分区转换为mapPartitions后,我无法获得相同的效果。 mapPartitions函数有效,但是我无法将它返回的RDD转换为BlockMatrix

RDD函数可从每个line_id分区的每个doc_id向量中分别创建密集矩阵

def vec2mat_p(iter):
    yield [((row.line_id-1, 0), 
            Matrices.dense(1, vecDim, (row.vectors.toArray().tolist())), )
        for row in iter]

分别从每个line_id分区的每个doc_id向量中创建密集矩阵

mat_doc = df_SO.rdd.mapPartitions(vec2mat_p, preservesPartitioning=True)

检查

mat_doc 
PythonRDD[4991] at RDD at PythonRDD.scala:48
mat_test.take(1)
[[((0, 0),
   DenseMatrix(1, 50, [1.814, -1.1681, -2.1887, -0.5371, -0.7509, 2.3679, 0.2795, 1.4135, ..., -0.3584, 0.5059, -0.6429, -0.6391, 0.0173, 1.2109, 1.804, -0.9402], 0)),
  ((1, 0),
   DenseMatrix(1, 50, [0.3884, -1.451, -0.0431, -0.4653, -2.4541, 0.2396, 1.8704, 0.8471, ..., -2.5164, 0.1298, -1.2702, -0.1286, 0.9196, -0.7354, -0.1816, -0.4553], 0)),
  ((2, 0),
   DenseMatrix(1, 50, [0.1382, 1.6753, 0.9563, -1.5251, 0.1753, 0.9822, 0.5952, -1.3924, ..., 0.9636, -1.7299, 0.2138, -2.5694, 0.1701, 0.2554, -1.4879, -1.6504], 0)),
  ...]]

检查类型

(mat_doc 
    .filter(lambda p: len(p) > 0)
    .map(lambda mlst: [(type(m[0]), (type(m[0][0]),type(m[0][1])), type(m[1])) for m in mlst] )
    .first()
)
[(tuple, (int, int), pyspark.mllib.linalg.DenseMatrix),
 (tuple, (int, int), pyspark.mllib.linalg.DenseMatrix),
 (tuple, (int, int), pyspark.mllib.linalg.DenseMatrix),
 ...]

似乎正确,但是正在运行:

(mat_doc 
    .filter(lambda p: len(p) > 0)
    .map(lambda mlst: [BlockMatrix((m[0], m[1])[0], 1, vecDim) for m in mlst] )
    .first()
)

导致以下类型错误:

TypeError: blocks should be an RDD of sub-matrix blocks as ((int, int), matrix) tuples, got 

不幸的是,该错误停止了很短的时间,并没有告诉我它是什么'

此外,我无法在sc.parallelize()通话中致电map()

如何将mapPartitions返回的RDD迭代器中的每个项目转换为BlockMatrix将接受的RDD?

0 个答案:

没有答案