我使用带有Python(Pyspark)和jupyter Notebook的Spark 2.2.0。 以下是使用的数据集:
|-- srcIP: string (nullable = false)
|-- srcPort: string (nullable = false)
|-- dstIP: string (nullable = false)
|-- dstPort: string (nullable = false)
|-- taxonomy: string (nullable = false)
|-- heuristic: string (nullable = false)
|-- distance: string (nullable = false)
|-- label: string (nullable = true)
我为每列使用一个字符串索引器,如下所示:
labelIndexer = StringIndexer(inputCol='label', outputCol='label_ix').setHandleInvalid("keep")
srcIPIndexer = StringIndexer(inputCol='srcIP', outputCol='srcIP_ix').setHandleInvalid("keep")
dstIPIndexer = StringIndexer(inputCol='dstIP', outputCol='dstIP_ix').setHandleInvalid("keep")
taxonomyIndexer = StringIndexer(inputCol='taxonomy', outputCol='taxonomy_ix').setHandleInvalid("keep")
srcPortIndexer = StringIndexer(inputCol='srcPort', outputCol='srcPort_ix').setHandleInvalid("keep")
dstPortIndexer = StringIndexer(inputCol='dstPort', outputCol='dstPort_ix').setHandleInvalid("keep")
heuristicIndexer = StringIndexer(inputCol='heuristic', outputCol='heuristic_ix').setHandleInvalid("keep")
distanceIndexer = StringIndexer(inputCol='distance', outputCol='distance_ix').setHandleInvalid("keep")
output_fixed1 = labelIndexer.fit(Batchmawi).transform(Batchmawi)
output_fixed2 = srcIPIndexer.fit(output_fixed1).transform(output_fixed1)
output_fixed3 = dstIPIndexer.fit(output_fixed2).transform(output_fixed2)
output_fixed4 = taxonomyIndexer.fit(output_fixed3).transform(output_fixed3)
output_fixed5 = srcPortIndexer.fit(output_fixed4).transform(output_fixed4)
output_fixed6 = dstPortIndexer.fit(output_fixed5).transform(output_fixed5)
output_fixed7 = heuristicIndexer.fit(output_fixed6).transform(output_fixed6)
output_fixed8 = distanceIndexer.fit(output_fixed7).transform(output_fixed7)
向量汇编器
assembler = VectorAssembler(inputCols=['srcIP_ix', 'dstIP_ix', 'taxonomy_ix', 'srcPort_ix', 'dstPort_ix', 'heuristic_ix', 'distance_ix'], outputCol='features')
最后确定一个DecisionTree分类器
model_dt = DecisionTreeClassifier(labelCol='label_ix', featuresCol='features',maxBins=286000)
数据分为测试和训练
train_data_np,test_data_np = final_data.randomSplit([0.07,0.03],seed = 1000)
print("Nombre Instances dans TRAIN_DATA ",train_data_np.count())
print("Nombre Instances dans Test_DATA ",test_data_np.count())
将训练数据纳入决策树模型会触发错误
print("Training ..")
Dt_Model = model_dt.fit(train_np)
print("Decision Tree..DONE ..")
u'requirement failed: DecisionTree requires maxBins (= 48999) to be at least as large as the number of values in each categorical feature, but categorical feature 0 has 285135 values. Considering remove this and other categorical features with a large number of values, or add more training examples.'
请注意,maxBins设置为286000,已经足够了,但是我仍然会收到错误消息。
NB:我正在使用Azure HdInsight