我正在尝试使用pyspark解释器在ZEPPELIN中使用数据框。我执行了以下命令:
首先,我创建了一个获取数据库表TABLE的数据框:
%pyspark
from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark.ml.feature import StringIndexer
import pyspark.mllib
import pyspark.mllib.regression
from pyspark.mllib.regression import LabeledPoint
from pyspark.sql.functions import *
sqlContext = sqlc
df = sqlContext.sql("SELECT * FROM TABLE")
df.show(1)
这是正常的,我得到的结果是下一个:
+------------------------+--------------+------------+--------------+--------------------+---------+--------------+--------------+--------------------+------------+-------------+--------------+---------------------------+----------------+-----------+---------------+------+------+--------+----------------+--------------+-------------+------------------+------------------------+------+--------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+------------+------------+------------+------------+----+---------------+-------------+------+---------------+-------+-----------+--------------------+---------------+--------------------+----------+-----------+--------+--------+-------------+-----------+------+------------------+--------------+---------------------+------------------+-------------------+-----------+----------------+-------------+--------------------+--------------------+---------------+-----------------+-----+-----------+--------------------+-------------+---------------------+------+---------------+---------+----------+-------+------+
|id_avi_prs_defecto_causa|s_avi_defectos|s_avi_causas|s_avi_acciones|s_avi_grupos_defecto|s_avi_prs|orden_defectos|s_mto_aparatos| f_realizacion|s_avi_avisos|importe_equiv|s_cli_clientes|s_gen_recursos__conservador|subcontratado_sn|material_sn|s_art_articulos|tarifa|precio|cantidad|horas_invertidas|s_avi_urgencia|repetitivo_sn|descripcion_avisos|s_avi_incidencias_avisos|h24_sn|coincidente_sn|parametro_3|parametro_4|parametro_5|parametro_6|parametro_7|parametro_8|parametro_9|parametro_10|parametro_11|parametro_12|parametro_13|ppss|modelo_maniobra|accionamiento|cargas|velocidades_asc|paradas|n_viviendas|num_viviendas_planta|protocolo_placa| f_puesta_en_marcha|modelo_esc|inclinacion|desnivel|longitud|ancho_peldano|velocidades|perfil|altura_balaustrada|funcionamiento|proteccion_intemperie| antiguedad_anios|id_antiguedad_tramo|tipo_enlace|propiedad_enlace|codigo_postal| fecha_proxima_ipo| fecha_ultima_ipo|problematico_sn|tasa_avisos_anual|f_ita|ult_est_ipo| f_ult_accion|intervalo_ipo|n_revisiones_teoricas|activo|cuarto_maquinas|embarques|tipo_cable|puertas|ft_luz|
+------------------------+--------------+------------+--------------+--------------------+---------+--------------+--------------+--------------------+------------+-------------+--------------+---------------------------+----------------+-----------+---------------+------+------+--------+----------------+--------------+-------------+------------------+------------------------+------+--------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+------------+------------+------------+------------+----+---------------+-------------+------+---------------+-------+-----------+--------------------+---------------+--------------------+----------+-----------+--------+--------+-------------+-----------+------+------------------+--------------+---------------------+------------------+-------------------+-----------+----------------+-------------+--------------------+--------------------+---------------+-----------------+-----+-----------+--------------------+-------------+---------------------+------+---------------+---------+----------+-------+------+
| 83107.0| 184.0| 251.0| 175.0| 15.0| 234.0| 1.0| 347.0|2010-07-06 00:00:...| 234.0| null| 1151.0| 8691.0| N| N| null| null| null| null| null| N| N | NO ABRE PUERTAS| null| N| N| X030| ARCA| H| 04| 063| 04| B_1| ORO-2M5| 06| null| null|X030| ARCA| H| 04| 063| 04| null| null| ORONA 2005|2006-03-27 00:00:...| null| null| null| null| null| 063| null| null| null| null|10.559379853643966| 3.0| OMU| ORO| 20820|2018-03-15 00:00:...|2012-03-15 00:00:...| N| 0.0| null| <?>|2015-03-01 00:00:...| 6.0| 9.0| S| <?>|ACC_2_180| CONV| TT| null|
+------------------------+--------------+------------+--------------+--------------------+---------+--------------+--------------+--------------------+------------+-------------+--------------+---------------------------+----------------+-----------+---------------+------+------+--------+----------------+--------------+-------------+------------------+------------------------+------+--------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+------------+------------+------------+------------+----+---------------+-------------+------+---------------+-------+-----------+--------------------+---------------+--------------------+----------+-----------+--------+--------+-------------+-----------+------+------------------+--------------+---------------------+------------------+-------------------+-----------+----------------+-------------+--------------------+--------------------+---------------+-----------------+-----+-----------+--------------------+-------------+---------------------+------+---------------+---------+----------+-------+------+
接下来,我想以这种方式应用StringIndexer:
%pyspark
indexer = StringIndexer(inputCol="parametro_3", outputCol="parametro_3indexed")
df = indexer.fit(df).transform(df)
df.show()
但我得到了下一个错误:
Py4JJavaError: An error occurred while calling o4450.showString.
: java.lang.NullPointerException
(<class 'py4j.protocol.Py4JJavaError'>, Py4JJavaError(u'An error occurred while calling o4450.showString.\n', JavaObject id=o4451), <traceback object at 0x20c0d40>)
我的问题是,为什么我会收到此错误?我试过这个并且它有效:
df = spark.createDataFrame(
[(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")],
["id", "category"])
indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
indexed = indexer.fit(df).transform(df)
indexed.show()
我做错了什么?或者我的DataFrame有什么问题?