这是我要编码的列中的所有不同值。 state_msg
是string
。
df.groupBy('state_msg').count().show()
+----------+--------+
| state_msg| count|
+----------+--------+
|Redirected| 28|
| Busy| 164790|
| Canceled| 1063663|
| Finished|36100201|
|Terminated| 12982|
| Failed| 941183|
| Timed out| 5726363|
| Error| 1957993|
| Off-line| 186322|
| Not found| 592259|
+----------+--------+
我正在尝试对该列进行一次热编码:
import pyspark.sql.functions as func
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol='state_msg', outputCol='state_msg_index')
indexed_df = indexer.fit(df).transform(df)
但是我收到这个异常,没有意义,因为根据上述groupBy产生的不同值,"1234567890"
在state_msg
中不是可能的值。
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NumberFormatException: For input string: "1234567890"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:583)
at java.lang.Integer.parseInt(Integer.java:615)
df.groupBy('state_msg').count().show(n=100)
+----------+--------+
| state_msg| count|
+----------+--------+
|Redirected| 28|
| Busy| 165241|
| Canceled| 1067515|
| Finished|36270559|
|Terminated| 12997|
| Failed| 944131|
| Timed out| 5745550|
| Error| 1959041|
| Off-line| 186899|
| Not found| 593823|
+----------+--------+
df.agg(countDistinct('state_msg').alias('count')).show()
+-----+
|count|
+-----+
| 10|
+-----+