我正在尝试PySpark的机器学习教程。
当我进入“关联和数据准备”部分时遇到问题。
尝试在此处运行此代码:
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import UserDefinedFunction
binary_map = {'Yes':1.0, 'No':0.0, 'True':1.0, 'False':0.0}
toNum = UserDefinedFunction(lambda k: binary_map[k], DoubleType())
CV_data = CV_data.drop('State').drop('Area code') \
.drop('Total day charge').drop('Total eve charge') \
.drop('Total night charge').drop('Total intl charge') \
.withColumn('Churn', toNum(CV_data['Churn'])) \
.withColumn('International plan', toNum(CV_data['International plan'])) \
.withColumn('Voice mail plan', toNum(CV_data['Voice mail plan'])).cache()
final_test_data = final_test_data.drop('State').drop('Area code') \
.drop('Total day charge').drop('Total eve charge') \
.drop('Total night charge').drop('Total intl charge') \
.withColumn('Churn', toNum(final_test_data['Churn'])) \
.withColumn('International plan', toNum(final_test_data['International plan'])) \
.withColumn('Voice mail plan', toNum(final_test_data['Voice mail plan'])).cache()
这是打印在终端上的错误消息(部分)。
17/06/20 17:58:53 WARN BlockManager: Putting block rdd_38_0 failed due to an exception
17/06/20 17:58:53 WARN BlockManager: Block rdd_38_0 could not be removed as it was not found on disk or in memory
17/06/20 17:58:53 WARN BlockManager: Putting block rdd_53_0 failed due to an exception
17/06/20 17:58:53 WARN BlockManager: Block rdd_53_0 could not be removed as it was not found on disk or in memory
17/06/20 17:58:53 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 16)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/main/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 174, in main
process()
File "/home/main/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 169, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/main/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 106, in <lambda>
func = lambda _, it: map(mapper, it)
File "<string>", line 1, in <lambda>
File "/home/main/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 70, in <lambda>
return lambda *a: f(*a)
File "<stdin>", line 1, in <lambda>
KeyError: False
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
....
可以从this document here查看剩余的错误消息。
有谁知道这是什么问题???
提前致谢。
答案 0 :(得分:1)
[解决]
我在引用this thread from 2 months back后解决了这个问题。
主要问题是上面提到的@ user6910411。这是一个数据类型错误。
由于我不需要将所有数据打印为数字,因此我排除了变量 CV_data 和 final_test_data 的最后3行代码。教程网站:
从 CV_data 中排除:
.withColumn('Churn', toNum(CV_data['Churn'])) \
.withColumn('International plan', toNum(CV_data['International plan'])) \
.withColumn('Voice mail plan', toNum(CV_data['Voice mail plan'])).cache()
从 final_test_data 中排除:
.withColumn('Churn', toNum(final_test_data['Churn'])) \
.withColumn('International plan', toNum(final_test_data['International plan'])) \
.withColumn('Voice mail plan', toNum(final_test_data['Voice mail plan'])).cache()
表打印出来:
>>> pd.DataFrame(CV_data.take(5), columns=CV_data.columns).transpose()
17/06/21 13:49:54 WARN Executor: 1 block locks were not released by TID = 11:
[rdd_16_0]
0 1 2 3 4
Account length 128 107 137 84 75
International plan No No No Yes Yes
Voice mail plan Yes Yes No No No
Number vmail messages 25 26 0 0 0
Total day minutes 265.1 161.6 243.4 299.4 166.7
Total day calls 110 123 114 71 113
Total eve minutes 197.4 195.5 121.2 61.9 148.3
Total eve calls 99 103 110 88 122
Total night minutes 244.7 254.4 162.6 196.9 186.9
Total night calls 91 103 104 89 121
Total intl minutes 10 13.7 12.2 6.6 10.1
Total intl calls 3 3 5 7 3
Customer service calls 1 1 0 2 3
Churn False False False False False