当我尝试在pyspark中使用数据框时,NullPointer异常

时间:2016-10-18 13:06:40

标签: python apache-spark dataframe null pyspark

我正在尝试使用pyspark解释器在ZEPPELIN中使用数据框。我执行了以下命令:

首先,我创建了一个获取数据库表TABLE的数据框:

%pyspark

from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark.ml.feature import StringIndexer
import pyspark.mllib
import pyspark.mllib.regression
from pyspark.mllib.regression import LabeledPoint
from pyspark.sql.functions import *


sqlContext = sqlc
df = sqlContext.sql("SELECT * FROM TABLE")

df.show(1)

这是正常的,我得到的结果是下一个:

+------------------------+--------------+------------+--------------+--------------------+---------+--------------+--------------+--------------------+------------+-------------+--------------+---------------------------+----------------+-----------+---------------+------+------+--------+----------------+--------------+-------------+------------------+------------------------+------+--------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+------------+------------+------------+------------+----+---------------+-------------+------+---------------+-------+-----------+--------------------+---------------+--------------------+----------+-----------+--------+--------+-------------+-----------+------+------------------+--------------+---------------------+------------------+-------------------+-----------+----------------+-------------+--------------------+--------------------+---------------+-----------------+-----+-----------+--------------------+-------------+---------------------+------+---------------+---------+----------+-------+------+
|id_avi_prs_defecto_causa|s_avi_defectos|s_avi_causas|s_avi_acciones|s_avi_grupos_defecto|s_avi_prs|orden_defectos|s_mto_aparatos|       f_realizacion|s_avi_avisos|importe_equiv|s_cli_clientes|s_gen_recursos__conservador|subcontratado_sn|material_sn|s_art_articulos|tarifa|precio|cantidad|horas_invertidas|s_avi_urgencia|repetitivo_sn|descripcion_avisos|s_avi_incidencias_avisos|h24_sn|coincidente_sn|parametro_3|parametro_4|parametro_5|parametro_6|parametro_7|parametro_8|parametro_9|parametro_10|parametro_11|parametro_12|parametro_13|ppss|modelo_maniobra|accionamiento|cargas|velocidades_asc|paradas|n_viviendas|num_viviendas_planta|protocolo_placa|  f_puesta_en_marcha|modelo_esc|inclinacion|desnivel|longitud|ancho_peldano|velocidades|perfil|altura_balaustrada|funcionamiento|proteccion_intemperie|  antiguedad_anios|id_antiguedad_tramo|tipo_enlace|propiedad_enlace|codigo_postal|   fecha_proxima_ipo|    fecha_ultima_ipo|problematico_sn|tasa_avisos_anual|f_ita|ult_est_ipo|        f_ult_accion|intervalo_ipo|n_revisiones_teoricas|activo|cuarto_maquinas|embarques|tipo_cable|puertas|ft_luz|
+------------------------+--------------+------------+--------------+--------------------+---------+--------------+--------------+--------------------+------------+-------------+--------------+---------------------------+----------------+-----------+---------------+------+------+--------+----------------+--------------+-------------+------------------+------------------------+------+--------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+------------+------------+------------+------------+----+---------------+-------------+------+---------------+-------+-----------+--------------------+---------------+--------------------+----------+-----------+--------+--------+-------------+-----------+------+------------------+--------------+---------------------+------------------+-------------------+-----------+----------------+-------------+--------------------+--------------------+---------------+-----------------+-----+-----------+--------------------+-------------+---------------------+------+---------------+---------+----------+-------+------+
|                 83107.0|         184.0|       251.0|         175.0|                15.0|    234.0|           1.0|         347.0|2010-07-06 00:00:...|       234.0|         null|        1151.0|                     8691.0|               N|          N|           null|  null|  null|    null|            null|             N|          N  |   NO ABRE PUERTAS|                    null|     N|             N|       X030|       ARCA|          H|         04|        063|         04|        B_1|     ORO-2M5|          06|        null|        null|X030|           ARCA|            H|    04|            063|     04|       null|                null|     ORONA 2005|2006-03-27 00:00:...|      null|       null|    null|    null|         null|        063|  null|              null|          null|                 null|10.559379853643966|                3.0|        OMU|             ORO|        20820|2018-03-15 00:00:...|2012-03-15 00:00:...|              N|              0.0| null|        <?>|2015-03-01 00:00:...|          6.0|                  9.0|     S|            <?>|ACC_2_180|      CONV|     TT|  null|
+------------------------+--------------+------------+--------------+--------------------+---------+--------------+--------------+--------------------+------------+-------------+--------------+---------------------------+----------------+-----------+---------------+------+------+--------+----------------+--------------+-------------+------------------+------------------------+------+--------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+------------+------------+------------+------------+----+---------------+-------------+------+---------------+-------+-----------+--------------------+---------------+--------------------+----------+-----------+--------+--------+-------------+-----------+------+------------------+--------------+---------------------+------------------+-------------------+-----------+----------------+-------------+--------------------+--------------------+---------------+-----------------+-----+-----------+--------------------+-------------+---------------------+------+---------------+---------+----------+-------+------+

接下来,我想以这种方式应用StringIndexer:

%pyspark

indexer = StringIndexer(inputCol="parametro_3", outputCol="parametro_3indexed")
df = indexer.fit(df).transform(df)
df.show()

但我得到了下一个错误:

Py4JJavaError: An error occurred while calling o4450.showString.
: java.lang.NullPointerException
(<class 'py4j.protocol.Py4JJavaError'>, Py4JJavaError(u'An error occurred while calling o4450.showString.\n', JavaObject id=o4451), <traceback object at 0x20c0d40>)

我的问题是,为什么我会收到此错误?我试过这个并且它有效:

df = spark.createDataFrame(
[(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")],
["id", "category"])
indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
indexed = indexer.fit(df).transform(df)
indexed.show()

我做错了什么?或者我的DataFrame有什么问题?

0 个答案:

没有答案