pyspark java.io.IOException:无法重命名文件

时间:2019-02-22 01:53:19

标签: pyspark

当输入文件很大时,我一直遇到此错误。不胜感激。

原因:java.io.IOException:无法重命名文件C:\ spark \ storage_space \ blockmgr-34e6f308-0735-45bd-9cfc-5087accfe917 \ 0c \ shuffle_14_1_0.data.da203675-cf88-4977-81d8-1906072fe655到C:\ spark \ storage_space \ blockmgr-34e6f308-0735-45bd-9cfc-5087accfe917 \ 0c \ shuffle_14_1_0.data         在org.apache.spark.shuffle.IndexShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlockResolver.scala:183)         在org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:164)         在org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)         在org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)         在org.apache.spark.scheduler.Task.run(Task.scala:108)         在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:338)         在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)         在java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)         ...还有1个

19/02/21 17:52:01信息SparkContext:从关机钩中调用stop() 19/02/21 17:52:01信息SparkUI:在http://127.0.0.1:4040处停止了Spark Web UI 19/02/21 17:52:01 INFO MapOutputTrackerMasterEndpoint:MapOutputTrackerMasterEndpoint已停止! 19/02/21 17:52:02信息MemoryStore:MemoryStore已清除 19/02/21 17:52:02信息BlockManager:BlockManager已停止 19/02/21 17:52:02信息BlockManagerMaster:BlockManagerMaster已停止 19/02/21 17:52:02信息OutputCommitCoordinator $ OutputCommitCoordinatorEndpoint:OutputCommitCoordinator已停止! 19/02/21 17:52:02信息SparkContext:成功停止了SparkContext 19/02/21 17:52:02 INFO ShutdownHookManager:调用了关闭挂钩 19/02/21 17:52:02信息ShutdownHookManager:删除目录C:\ spark \ storage_space \ spark-af366c79-78e7-4b55-9513-d411cd1d418c 19/02/21 17:52:02信息ShutdownHookManager:删除目录C:\ spark \ storage_space \ spark-af366c79-78e7-4b55-9513-d411cd1d418c \ pyspark-e4a37af5-7a19-4000-b921-7351285766d2

这是我的代码:

import re
import sys
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
import numpy as np
from copy import deepcopy

conf = SparkConf()
conf.set('spark.local.dir', 'C:/spark/storage_space')
sc = SparkContext(conf=conf)
lines = sc.textFile(sys.argv[1])
MAX_ITER = 40
beta = 0.8
nodes = 100 #to change


edges = lines.map(lambda x: (int(re.split('\s+', x)[0]), int(re.split('\s+', x)[1])))
unique_edges = edges.distinct() #there might be repeated ones
key_edges = unique_edges.map(lambda x: (int(x[0]),1))
out_edges = key_edges.reduceByKey(lambda accum, n: accum + n) #store key as the noe number and value as number of outgoing edges
out_edges_lst = unique_edges.collect()
output_dict = dict((int(key),value) for key,value in out_edges.collect())
m_init = unique_edges.map(lambda x: ((x[0], x[1]),1/output_dict.get(x[0])))
m_ji = m_init.map(lambda x: ((x[0][1], x[0][0]), x[1])) #((destin, source), 1/deg_source) #row,col

def hits():
    lambdax = 1
    miu =1
    L = unique_edges #row, col
    #LT = L.map(lambda x: (x[1], x[0])) #col, row
    new_h = [1 for i in range(nodes)]
    for i in range(MAX_ITER):
        ###PROCESSING A FIRST ###
        LT_h = L.map(lambda x: (x[1], new_h[x[0]-1]*miu)) #h[col-1] #(row, value)
        LT_h_add = LT_h.reduceByKey(lambda x, y: x+y)
        LT_max = LT_h_add.max(lambda x: x[1])[1] #returns max pair, and get the value
        new_a = LT_h_add.map(lambda x: (x[0]-1,float(x[1])/LT_max)).collectAsMap()
        print("this is iteration round" + str(i))
        print("printing new a")
        print(new_a)
        #a is stored as a dictionary
        ##PROCESSING h###
        L_a = L.map(lambda x: (x[0], new_a[x[1]-1]*lambdax))
        L_a_add = L_a.reduceByKey(lambda x, y: x+y)
        L_a_add_max = L_a_add.max(lambda x: x[1])[1]
        new_h = L_a_add.map(lambda x: (x[0]-1, float(x[1])/L_a_add_max)).collectAsMap()
        print(new_h)
    final_a = sorted(new_a.items(), key = lambda x: x[1])[:5]
    final_h = sorted(new_h.items(), key = lambda x: x[1])[:5]
    print("printing top 5 for H")
    print(final_h)

0 个答案:

没有答案