Question

我使用带有Python（Pyspark）和jupyter Notebook的Spark 2.2.0。以下是使用的数据集：

 |-- srcIP: string (nullable = false)
 |-- srcPort: string (nullable = false)
 |-- dstIP: string (nullable = false)
 |-- dstPort: string (nullable = false)
 |-- taxonomy: string (nullable = false)
 |-- heuristic: string (nullable = false)
 |-- distance: string (nullable = false)
 |-- label: string (nullable = true)

我为每列使用一个字符串索引器，如下所示：

labelIndexer = StringIndexer(inputCol='label', outputCol='label_ix').setHandleInvalid("keep")
srcIPIndexer = StringIndexer(inputCol='srcIP', outputCol='srcIP_ix').setHandleInvalid("keep")
dstIPIndexer = StringIndexer(inputCol='dstIP', outputCol='dstIP_ix').setHandleInvalid("keep")
taxonomyIndexer = StringIndexer(inputCol='taxonomy', outputCol='taxonomy_ix').setHandleInvalid("keep")
srcPortIndexer = StringIndexer(inputCol='srcPort', outputCol='srcPort_ix').setHandleInvalid("keep")
dstPortIndexer = StringIndexer(inputCol='dstPort', outputCol='dstPort_ix').setHandleInvalid("keep")
heuristicIndexer = StringIndexer(inputCol='heuristic', outputCol='heuristic_ix').setHandleInvalid("keep")
distanceIndexer = StringIndexer(inputCol='distance', outputCol='distance_ix').setHandleInvalid("keep")

output_fixed1 = labelIndexer.fit(Batchmawi).transform(Batchmawi)
output_fixed2 = srcIPIndexer.fit(output_fixed1).transform(output_fixed1)
output_fixed3 = dstIPIndexer.fit(output_fixed2).transform(output_fixed2)
output_fixed4 = taxonomyIndexer.fit(output_fixed3).transform(output_fixed3)
output_fixed5 = srcPortIndexer.fit(output_fixed4).transform(output_fixed4)
output_fixed6 = dstPortIndexer.fit(output_fixed5).transform(output_fixed5)
output_fixed7 = heuristicIndexer.fit(output_fixed6).transform(output_fixed6)
output_fixed8 = distanceIndexer.fit(output_fixed7).transform(output_fixed7)

向量汇编器

assembler = VectorAssembler(inputCols=['srcIP_ix', 'dstIP_ix', 'taxonomy_ix', 'srcPort_ix', 'dstPort_ix', 'heuristic_ix', 'distance_ix'], outputCol='features')

最后确定一个DecisionTree分类器

model_dt = DecisionTreeClassifier(labelCol='label_ix', featuresCol='features',maxBins=286000)

数据分为测试和训练

train_data_np,test_data_np = final_data.randomSplit([0.07,0.03],seed = 1000)
print("Nombre Instances dans TRAIN_DATA ",train_data_np.count())
print("Nombre Instances dans Test_DATA ",test_data_np.count())

将训练数据纳入决策树模型会触发错误

print("Training ..")
Dt_Model = model_dt.fit(train_np)
print("Decision Tree..DONE ..")

u'requirement failed: DecisionTree requires maxBins (= 48999) to be at least as large as the number of values in each categorical feature, but categorical feature 0 has 285135 values. Considering remove this and other categorical features with a large number of values, or add more training examples.'

请注意，maxBins设置为286000，已经足够了，但是我仍然会收到错误消息。

NB：我正在使用Azure HdInsight

Spark ML DecisionTree maxBins错误

0 个答案: