我正在尝试用csv文件构建一个随机森林。到目前为止我的代码:
CSV_PATH = "C:/Users/xxxxx/Documents/tutorials/features_cut.csv"
APP_NAME = "Random Forest Example"
SPARK_URL = "local[*]"
RANDOM_SEED = 13579
TRAINING_DATA_RATIO = 0.7
RF_NUM_TREES = 3
RF_MAX_DEPTH = 4
RF_NUM_BINS = 32
from pyspark import SparkContext
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName(APP_NAME) \
.master(SPARK_URL) \
.getOrCreate()
data = spark.read \
.options(header = "true", inferschema = "true") \
.csv(CSV_PATH)
print("Total number of rows: %d" % df.count()) #470121
DataFrame中的所有值都是字符串。
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint
transformed_data = data.rdd.map(lambda x: LabeledPoint(x[-1], Vectors.dense(x[0:-1])))
splits = [TRAINING_DATA_RATIO, 1.0 - TRAINING_DATA_RATIO]
training_data, test_data = transformed_data.randomSplit(splits, RANDOM_SEED)
这里出现了ERROR:
print("Number of training set rows: %s" % training_data.count())
print("Number of test set rows: %s " % test_data.count())
#Value Error: could not convert string to float 'PLOT'
我无法理解rdd.map函数的工作原理。 我很感激你的帮助。
更新:
indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").setHandleInvalid("keep") for column in list(set(data.columns)) ]
pipeline = Pipeline(stages=indexers)
df_r = pipeline.fit(data).transform(data)
在StringIndexer和Pipeline的帮助下,我可以将我的字符串转换为双精度。