我正在尝试学习Spark语言,因此我安装了2.4版,并且为此使用了Santander database available for Kaggle。
我导入了库,导入了培训和测试文件,并删除了文件头。现在一切都在RDD中。
我注意到有些列具有数字值但作为字符串。我的计划是:将每一行转换为ROW对象,然后创建一个Spark DataFrame,为每一行创建密集向量,然后应用machine_learning。
# import of libraries
from pyspark.streaming import StreamingContext
from pyspark import SparkContext
from pyspark.sql.functions import col, sum
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import PCA
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# create spark session
spSession = SparkSession.builder.master("local").appName("DSA-MiniProjeto4").getOrCreate()
# reading files
arquivo_treino = sc.textFile("train.csv")
arquivo_teste = sc.textFile("test.csv")
arquivo_treino.cache()
arquivo_teste.cache()
# creat list with column names:
nome_colunas_treino = arquivo_treino.first().split(',')
nome_colunas_teste = arquivo_teste.first().split(',')
# create RDD without header:
cabecalho_treino = arquivo_treino.first()
cabecalho_teste = arquivo_teste.first()
treinoRDD = arquivo_treino.filter(lambda x: x != cabecalho_treino)
testeRDD = arquivo_teste.filter(lambda x: x != cabecalho_teste)
问题在于此数据集具有371列。因此,我考虑过创建一个内部带有while循环的函数,以循环浏览每一列。然后是错误。很快,我第一次尝试得到以下消息。
# function to convert columns to float:
def transformaParaFlutuante(linha):
tamanho = len(nome_colunas_teste)
i = 1 # para não tratar a coluna ID
while i <= tamanho:
nome_col = nome_colunas_teste[i]
val_coln = float(nome_colunas_teste[i])
linha = Row(nome_col = val_coln)
i = i + 1
return linha
# applying the function and creating new RDDs
treinoRDD_v2 = treinoRDD.map(transformaParaFlutuante)
treinoRDD_v2.persist()
testeRDD_v2 = testeRDD.map(transformaParaFlutuante)
testeRDD_v2.persist()
# up to this point no error was reported
# When executing the line below the error occurs
treinoRDD_v2.take(3)
错误:
Py4JJavaError Traceback(最近的呼叫 最后) 1#ver resultado treino ----> 2 treinoRDD_v2.take(3)
/opt/spark/python/pyspark/rdd.py in take(self,num)1358 1359
p =范围(partsScanned,min(partsScanned + numPartsToTry,totalParts)) -> 1360 res = self.context.runJob(self,takeUpToNumLeft,p)1361 1362个项目+ = res/opt/spark/python/pyspark/context.py在runJob(self,rdd, partitionFunc,partitions,allowLocal)1067# SparkContext#runJob。 1068映射的RDD = rdd.mapPartitions(partitionFunc) -> 1069 sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(),mappedRDD._jrdd,分区)1070返回 列表(_load_from_socket(sock_info,已映射RDD._jrdd_deserializer))
1071/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py在 的呼叫(个体,*参数)1255答案= self.gateway_client.send_command(命令)1256 RETURN_VALUE = get_return_value( -> 1257 answer,self.gateway_client,self.target_id,self.name)1258 1259 for temp_args中的temp_arg:
/opt/spark/python/pyspark/sql/utils.py in deco(* a,** kw) 61 def deco(* a,** kw): 62试试: ---> 63返回f(* a,** kw) 64,除了py4j.protocol.Py4JJavaError如e: 65 s = e.java_exception.toString()
/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py在 get_return_value(回答,gateway_client,target_id,名) 第326章 327“调用{0} {1} {2}时发生错误。\ n”。 -> 328格式(target_id,“。”,名称),值) 329其他: 330提高Py4JError(
Py4JJavaError:调用时发生错误 z:org.apache.spark.api.python.PythonRDD.runJob。 : org.apache.spark.SparkException:由于阶段失败,作业中止了: 阶段12.0中的任务0失败1次,最近一次失败:丢失的任务 在阶段12.0(TID 14,本地主机,执行程序驱动程序)中为0.0:org.apache.spark.api.python.PythonException:追溯(最新 最后调用):文件 “ /opt/spark/python/lib/pyspark.zip/pyspark/worker.py”,行377,在 主要 process()在第372行的“ /opt/spark/python/lib/pyspark.zip/pyspark/worker.py”文件中 处理 serializer.dump_stream(func(split_index,iterator),outfile)文件“ /opt/spark/python/lib/pyspark.zip/pyspark/serializers.py”,行 393,在dump_stream中 vs = list(itertools.islice(iterator,batch))文件“ /opt/spark/python/lib/pyspark.zip/pyspark/util.py”,第99行,在 包装纸 在transformaParaFlutuante中返回f(* args,** kwargs)文件“”,第8行 ValueError:无法将字符串转换为float:'var3'
在org.apache.spark.api.python.BasePythonRunner $ ReaderIterator.handlePythonException(PythonRunner.scala:453) 在 org.apache.spark.api.python.PythonRunner $$ anon $ 3.read(PythonRunner.scala:588) 在 org.apache.spark.api.python.PythonRunner $$ anon $ 3.read(PythonRunner.scala:571) 在 org.apache.spark.api.python.BasePythonRunner $ ReaderIterator.hasNext(PythonRunner.scala:406) 在 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) 在 org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221) 在 org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:349) 在 org.apache.spark.storage.BlockManager。$ anonfun $ doPutIterator $ 1(BlockManager.scala:1182) 在 org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) 在 org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) 在 org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882) 在org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)处 org.apache.spark.rdd.RDD.iterator(RDD.scala:286)在 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)在 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)在 org.apache.spark.rdd.RDD.iterator(RDD.scala:288)在 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)在 org.apache.spark.scheduler.Task.run(Task.scala:121)在 org.apache.spark.executor.Executor $ TaskRunner。$ anonfun $ run $ 3(Executor.scala:411) 在org.apache.spark.util.Utils $ .tryWithSafeFinally(Utils.scala:1360) 在 org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:414) 在 java.base / java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 在 java.base / java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:628) 在java.base / java.lang.Thread.run(Thread.java:834)
驱动程序堆栈跟踪:位于 org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:1889) 在 org.apache.spark.scheduler.DAGScheduler。$ anonfun $ abortStage $ 2(DAGScheduler.scala:1877) 在 org.apache.spark.scheduler.DAGScheduler。$ anonfun $ abortStage $ 2 $ adapted(DAGScheduler.scala:1876) 在 scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) 在 scala.collection.mutable.ResizableArray.foreach $(ResizableArray.scala:55) 在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) 在 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876) 在 org.apache.spark.scheduler.DAGScheduler。$ anonfun $ handleTaskSetFailed $ 1(DAGScheduler.scala:926) 在 org.apache.spark.scheduler.DAGScheduler。$ anonfun $ handleTaskSetFailed $ 1 $ adapted(DAGScheduler.scala:926) 在scala.Option.foreach(Option.scala:274)在 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048) 在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:49) 在 org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737) 在org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)处 org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)在 org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)在 org.apache.spark.api.python.PythonRDD $ .runJob(PythonRDD.scala:153)在 org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)在 java.base / jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(本机 方法) java.base / jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 在 java.base / jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 在java.base / java.lang.reflect.Method.invoke(Method.java:566)在 py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)在 py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)在 py4j.Gateway.invoke(Gateway.java:282)在 py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) 在py4j.commands.CallCommand.execute(CallCommand.java:79)处 py4j.GatewayConnection.run(GatewayConnection.java:238)在 java.base / java.lang.Thread.run(Thread.java:834)由以下原因引起: org.apache.spark.api.python.PythonException:追溯(最新 最后调用):文件 “ /opt/spark/python/lib/pyspark.zip/pyspark/worker.py”,行377,在 主要 process()在第372行的“ /opt/spark/python/lib/pyspark.zip/pyspark/worker.py”文件中 处理 serializer.dump_stream(func(split_index,iterator),outfile)文件“ /opt/spark/python/lib/pyspark.zip/pyspark/serializers.py”,行 393,在dump_stream中 vs = list(itertools.islice(iterator,batch))文件“ /opt/spark/python/lib/pyspark.zip/pyspark/util.py”,第99行,在 包装纸 在transformaParaFlutuante中返回f(* args,** kwargs)文件“”,第8行 ValueError:无法将字符串转换为float:'var3'
在org.apache.spark.api.python.BasePythonRunner $ ReaderIterator.handlePythonException(PythonRunner.scala:453) 在 org.apache.spark.api.python.PythonRunner $$ anon $ 3.read(PythonRunner.scala:588) 在 org.apache.spark.api.python.PythonRunner $$ anon $ 3.read(PythonRunner.scala:571) 在 org.apache.spark.api.python.BasePythonRunner $ ReaderIterator.hasNext(PythonRunner.scala:406) 在 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) 在 org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221) 在 org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:349) 在 org.apache.spark.storage.BlockManager。$ anonfun $ doPutIterator $ 1(BlockManager.scala:1182) 在 org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) 在 org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) 在 org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882) 在org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)处 org.apache.spark.rdd.RDD.iterator(RDD.scala:286)在 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)在 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)在 org.apache.spark.rdd.RDD.iterator(RDD.scala:288)在 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)在 org.apache.spark.scheduler.Task.run(Task.scala:121)在 org.apache.spark.executor.Executor $ TaskRunner。$ anonfun $ run $ 3(Executor.scala:411) 在org.apache.spark.util.Utils $ .tryWithSafeFinally(Utils.scala:1360) 在 org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:414) 在 java.base / java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 在 java.base / java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:628) ...还有1个
我需要的结果是这样的:
Row(ACCELERATION=12.0, CYLINDERS=8.0, DISPLACEMENT=307.0
有任何有关修复方法的指南吗?
谢谢!