我正在尝试使用pyspark预处理预测模型的数据。 当我在预处理中尝试spark.createDataFrame时出现错误。是否可以在将处理后的RDD转换为数据帧之前检查其外观?
import findspark
findspark.init('/usr/local/spark')
import pyspark
from pyspark.sql import SQLContext
import os
import pandas as pd
import geohash2
sc = pyspark.SparkContext('local', 'sentinel')
spark = pyspark.SQLContext(sc)
sql = SQLContext(sc)
working_dir = os.getcwd()
df = sql.createDataFrame(data)
df = df.select(['starttime', 'latstart','lonstart', 'latfinish', 'lonfinish', 'trip_type'])
df.show(10, False)
processedRDD = df.rdd
processedRDD = processedRDD \
.map(lambda row: (row, g, b, minutes_per_bin)) \
.map(data_cleaner) \
.filter(lambda row: row != None)
print(processedRDD)
featuredDf = spark.createDataFrame(processedRDD, ['year', 'month', 'day', 'time_cat', 'time_num', 'time_cos', \
'time_sin', 'day_cat', 'day_num', 'day_cos', 'day_sin', 'weekend', \
'x_start', 'y_start', 'z_start','location_start', 'location_end', 'trip_type'])
我收到此错误:
[Stage 1:> (0 + 1) / 1]2019-10-24 15:37:56 ERROR Executor:91 - Exception in task 0.0 in stage 1.0 (TID 1)
raise AppRegistryNotReady("Apps aren't loaded yet.") django.core.exceptions.AppRegistryNotReady: Apps aren't loaded yet.
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
at org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:153)
at org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:153)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
我不知道这与导入应用程序有什么关系
答案 0 :(得分:0)
基本上,您需要先执行以下操作,然后加载设置并填充Django的应用程序注册表。您拥有Django文档中所需的所有信息。
答案 1 :(得分:0)
我不知道此脚本与Django到底有什么关系,但是在脚本顶部添加以下几行可能会解决此问题:
import os
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'myproject.settings')
import django
django.setup()
答案 2 :(得分:0)
我不是在手动运行Hadoop,而是制作了一个使用pyspack的python服务器,并在Django服务器上计算了10倍的重载AI算法。我的问题来自SPARK-LOCAL-IP,使用了不同的IP(我用来通过sshtunnel连接到远程数据库的IP)。我导入并使用pyspark。我必须重命名文件并添加正确的IP。
cd /usr/local/spark/conf
touch spark-env.sh.template
mv -i spark-env.sh.template spark-env.sh
nano spark-env.sh
paste: SPARK-LOCAL_IP="127.0.1.1"
然后我必须添加到我的views.py sc.setLogLevel(“ ERROR”)中才能看到真正的问题。有时在python中调试Java可能会出现问题。列是日期时间而不是字符串,我已将其修复。