StringIndexer NumberFormatException值在列中不可见

时间:2018-07-20 16:12:41

标签: java python apache-spark pyspark

这是我要编码的列中的所有不同值。 state_msgstring

df.groupBy('state_msg').count().show()
+----------+--------+                                                           
| state_msg|   count|
+----------+--------+
|Redirected|      28|
|      Busy|  164790|
|  Canceled| 1063663|
|  Finished|36100201|
|Terminated|   12982|
|    Failed|  941183|
| Timed out| 5726363|
|     Error| 1957993|
|  Off-line|  186322|
| Not found|  592259|
+----------+--------+

我正在尝试对该列进行一次热编码:

import pyspark.sql.functions as func

from pyspark.ml.feature import StringIndexer

indexer = StringIndexer(inputCol='state_msg', outputCol='state_msg_index')
indexed_df = indexer.fit(df).transform(df)

但是我收到这个异常,没有意义,因为根据上述groupBy产生的不同值,"1234567890"state_msg中不是可能的值。

    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NumberFormatException: For input string: "1234567890"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Integer.parseInt(Integer.java:583)
    at java.lang.Integer.parseInt(Integer.java:615)

df.groupBy('state_msg').count().show(n=100)
+----------+--------+
| state_msg|   count|
+----------+--------+
|Redirected|      28|
|      Busy|  165241|
|  Canceled| 1067515|
|  Finished|36270559|
|Terminated|   12997|
|    Failed|  944131|
| Timed out| 5745550|
|     Error| 1959041|
|  Off-line|  186899|
| Not found|  593823|
+----------+--------+

df.agg(countDistinct('state_msg').alias('count')).show()

+-----+
|count|
+-----+
|   10|
+-----+

0 个答案:

没有答案