PCA对PySpark的不规则执行

时间:2017-03-14 15:25:49

标签: csv pyspark pca

我正在使用PySpark通过csv文件处理PCA。我有一些奇怪的行为;我的代码有时完美无缺,但有时会返回此错误:

 File "C:/spark/spark-2.1.0-bin-hadoop2.7/bin/pca_final2.py", line 25, in <module>
columns = (fileObj.first()).split(';')
File "C:\spark\spark-2.1.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\rdd.py", line 1361, in first
File "C:\spark\spark-2.1.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\rdd.py", line 1343, in take
File "C:\spark\spark-2.1.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\context.py", line 965, in runJob
File "C:\spark\spark-2.1.0-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py", line 1133, in __call__
File "C:\spark\spark-2.1.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\utils.py", line 63, in deco
File "C:\spark\spark-2.1.0-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.net.SocketException: Connection reset by peer: socket write error

这是我的代码:

#########################! importing libraries !########################
from __future__ import print_function
from pyspark.ml.linalg import Vectors
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
from pyspark.ml.feature import PCA, VectorAssembler
from pyspark.mllib.linalg import Vectors
from pyspark.ml import Pipeline
from pyspark.sql import SQLContext
from pyspark import SparkContext
from pyspark.mllib.feature import Normalizer
import timeit
########################! main script !#################################
sc = SparkContext("local", "pca-app")
sqlContext = SQLContext(sc)
if __name__ == "__main__":
    spark = SparkSession\
        .builder\
        .appName("PCAExample")\
        .getOrCreate()  
    start=timeit.default_timer() 
    fileObj = sc.textFile('bigiris.csv')
    data = fileObj.map(lambda line: [float(k) for k in line.split(';')])
    columns = (fileObj.first()).split(';')
    df = spark.createDataFrame(data, columns)
    df.show()
    vecAssembler = VectorAssembler(inputCols=columns, outputCol="features")
    pca = PCA(k=2, inputCol="features", outputCol="pcaFeatures")
    pipeline = Pipeline(stages=[vecAssembler, pca])
    model = pipeline.fit(df)
    result = model.transform(df).select("pcaFeatures")
    stop=timeit.default_timer()
    result.show(truncate=False)
    time=stop-start
    print ("this operation takes ", (time), " seconds")
    spark.stop()

为什么我得到这种不正常的执行?我应该添加什么来解决这个问题。

2 个答案:

答案 0 :(得分:3)

创建data框架时,您没有过滤掉标题。假设您的列名是字符串,这将导致错误,因为列名不能转换为浮点值。请参阅下面的脚本修改部分,该部分使用filter删除标题。

fileObj = sc.textFile('e:/iris.data.txt')
header = fileObj.first()
data = fileObj.filter(lambda x: x != header).map(lambda line: [float(k) for k in line.split(';')])
columns = header.split(';')
df = spark.createDataFrame(data, columns)
df.show()

答案 1 :(得分:1)

此处错误通知第columns = (fileObj.first()).split(';')行。基本上,您正在尝试基于split fileObj (;)的第一行columns line should be before data line。这里执行的操作顺序是错误的,因为在上一步中行已经转换为列表。

正确的操作顺序是这个(fileObj = sc.textFile('bigiris.csv') columns = (fileObj.first()).split(';') data = fileObj.map(lambda line: [float(k) for k in line.split(';')]) df = spark.createDataFrame(data, columns) ): -

(;)

错误原因: - 行(数据= )包含 fileObj.map line.split(';') 即可。其中已经将每行csv与filter(lambda x: x != header)

分开

如果您在csv中将标题作为文字并想要从数据中删除,请按照Jaco回答 FileInputStream serviceAccount = new FileInputStream("src/main/resources/google-services.json"); FirebaseCredential firebaseCredential = FirebaseCredentials.fromCertificate(serviceAccount); // FirebaseCredentials.applicationDefault(); FirebaseOptions options = new FirebaseOptions.Builder()