我尝试加载包含许多列的this CSV文件,并使用Spark计算列之间的相关性。
from pyspark import SparkContext, SparkConf
from pyspark.mllib.stat import Statistics
conf = SparkConf()\
.setAppName("Movie recommender")\
.setMaster("local[*]")\
.set("spark.driver.memory", "10g")\
.set("spark.driver.maxResultSize", "4g")
sc = SparkContext(conf=conf)
pivot = sc.textFile(r"pivot.csv")
header = pivot.first()
pivot = pivot.filter(lambda x:x != header)
pivot = pivot.map(lambda x:x.split()).cache()
corrs = Statistics.corr(pivot)
我收到此错误:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.net.SocketException: Connection reset by peer: socket write error
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(Unknown Source)
at java.net.SocketOutputStream.write(Unknown Source)
答案 0 :(得分:0)
我设法通过增加分区来运行它。但是我的本地机器上的性能确实很慢,以至于它实际上并没有工作。当列数很高时,似乎会出现性能问题。
def extract_sparse(str_lst, N):
if len(str_lst) == 0:
return (0, {})
else:
keyvalue = {}
length = len(str_lst)
if length > N:
length = N
for i in range(length):
if str_lst[i] != '': # not missing
keyvalue[i] = float(str_lst[i])
return (length, keyvalue)
pivot = sc.textFile(r"pivot.csv", 24)
header = pivot.first()
pivot = pivot.filter(lambda x:x != header)
pivot = pivot.map(lambda x:x.split(','))
pivot = pivot.map(lambda x: extract_sparse(x, 50000))
pivot = pivot.map(lambda x: Vectors.sparse(x[0], x[1]))
pivot = pivot.map(lambda x: x.toArray()).collect()
corrs = Statistics.corr(pivot)