我正在尝试将数据分为训练集和测试集,但由于返回的数据集全为零,因此无法正常工作。我该如何解决?
注意*我的数据有更多样本,但为清楚起见,我仅包括前四个样本。
import pandas as pd
from pyspark.sql import SQLContext
from pyspark import SparkContext
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.sql import SparkSession
from pyspark.sql.types import *
data = pd.read_csv('path')
header=data.columns.tolist()
fn=header[:data.shape[1]-1]
assembler = VectorAssembler(inputCols=fn,outputCol="features")
spDF = sql.createDataFrame(data)
spDF.show()
输出:
+---+---+---+----+---+---+------+----+---+---+----+
| a| b| c| d| e| f| g| h| i| j|Type|
+---+---+---+----+---+---+------+----+---+---+----+
| 7| 0| 2| 700| 9| 10| 1153| 832| 9| 2| 1|
| 17| 7| 4|1230| 17| 19| 1265|1230| 17| 0| 0|
| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 31| 22| 3|3812| 39| 37| 18784|4380| 39| 8| 0|
+---+---+---+----+---+---+------+----+---+---+----+
spDF = assembler.transform(spDF)
spDF.show()
输出:
+---+---+---+----+---+---+------+----+---+---+----+--------------------+
| a| b| c| d| e| f| g| h| i| j|Type| features|
+---+---+---+----+---+---+------+----+---+---+----+--------------------+
| 7| 0| 2| 700| 9| 10| 1153| 832| 9| 2| 1|[7.0,0.0,2.0,700....|
| 17| 7| 4|1230| 17| 19| 1265|1230| 17| 0| 0|[17.0,7.0,4.0,123...|
| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| (10,[],[])|
| 31| 22| 3|3812| 39| 37| 18784|4380| 39| 8| 0|[31.0,22.0,3.0,38...|
+---+---+---+----+---+---+------+----+---+---+----+--------------------+
(train_data, test_data) = spDF.randomSplit([0.8,0.2])
train_data.show()
输出:
+---+---+---+---+---+---+---+---+---+---+----+----------+
| a| b| c| d| e| f| g| h| i| j|Type| features|
+---+---+---+---+---+---+---+---+---+---+----+----------+
| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|(10,[],[])|
| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|(10,[],[])|
| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|(10,[],[])|
| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|(10,[],[])|
+---+---+---+---+---+---+---+---+---+---+----+----------+
我想传递此训练数据以适合机器学习模型。