我一直在使用Python中的numpy / scipy特征分解运算符来计算最大尺寸为10K x 10K的矩阵的Fiedler特征向量。但我想缩放到更大的矩阵(例如100K或更大),并尽快执行计算。对于10K,我花了几分钟来进行特征分解。这是我当前正在使用的代码:
from scipy.linalg import eigh
w,v = eigh(lapMat)
sortWInds = argsort(w)
fVec = v[:,sortWInds[1]]
据我对Spark的了解,特征分解运算符仍然需要分布式系统中内核之间的大量串扰。我通过承包商进行了一些测试,但没有看到我希望在多核AWS AMI上使用Spark的提速。以下是用于在AWS Linux Ubuntu AMI上执行SVD的主要代码:
#Benchmarking Setup - Configure here for the benchmark size required
sizes=[10000]
cores=[64]
max_cores_for_a_single_executor=8
from pyspark import SparkContext, SparkConf
from pyspark.mllib.linalg.distributed import RowMatrix
from datetime import datetime
from pyspark.mllib.random import RandomRDDs
from pyspark.mllib.linalg import Vectors
# Iterate over matrix of size in sizes
for size in sizes:
# Iterate over number of cores used
for core in cores:
# Calculating spark configuration for a distributed setup
executor_cores= max_cores_for_a_single_executor if core>max_cores_for_a_single_executor else core
executors=1 if core/max_cores_for_a_single_executor==0 else core/max_cores_for_a_single_executor
#Initializing Spark
conf = SparkConf().setAppName("SVDBenchmarking")\
.set("spark.executor.cores",executor_cores)\
.set("spark.executor.instances",executors) \
.set("spark.dynamicAllocation.enabled","false")\
.set("spark.driver.maxResultSize","25g")\
.set("spark.executor.memory", "60g")\
sc = SparkContext.getOrCreate(conf=conf)
start = datetime.now()
# Input matrix of specific size generated and saved earlier
inputRdd=sc.textFile("hdfs://ip-172-31-34-253.us-west-2.compute.internal:8020/data/input"+str(size))
inputRdd=sc.textFile("/Users/data/input"+str(size))
intermid2=inputRdd\
.map(lambda x: textToVector(x))\
.sortByKey()\
.map(lambda x: extract(x))
mat=RowMatrix(intermid2)
# Step-2
# running SVD
svd = mat.computeSVD(size, computeU=True)
U = svd.U # The U factor is a RowMatrix.
s = svd.s # The singular values are stored in a local dense vector.
V = svd.V # The V factor is a local dense matrix.
# Stoping clock for benchmark
end = datetime.now()
鉴于矩阵的特征结构是推荐算法的关键,因此必须有“社区”工作来计算SVD,其速度要比numpy / scipy中的单核方法快得多。
最近,在为明确计算Fielder特征向量Urschel 2014的多网格算法方面也做出了努力。我想他一次使Matlab代码可用。
是否有人指向1)了解最新技术以在大型矩阵中快速计算非主导特征向量/ SV(例如Fiedler特征向量),2)执行这些操作的公共代码库3)建议架构在不占用RAM的情况下对大于等于10K的大小或更大的矩阵执行计算?
(虚心地)谢谢
Nirmal